ON REGULARIZATION FOR EXPLAINING GRAPH NEURAL NETWORKS: AN INFORMATION THEORY PERSPECTIVE

Abstract

This work studies the explainability of graph neural networks (GNNs), which is important for the credibility of GNNs in practical usage. Existing work mostly follows the two-phase paradigm to interpret a prediction: feature attribution and selection. However, another important component -regularization, which is crucial to facilitate the above paradigm -has been seldom studied. In this work, we explore the role of regularization in GNNs explainability from the perspective of information theory. Our main findings are: 1) regularization is essentially pursuing the balance between two phases, 2) its optimal coefficient is proportional to the sparsity of explanations, 3) existing methods imply an implicit regularization effect of stochastic mechanism, and 4) its contradictory effects on two phases are responsible for the out-of-distribution (OOD) issue in post-hoc explainability. Based on these findings, we propose two common optimization methods, which can bolster the performance of the current explanation methods via sparsity-adaptive and OOD-resistant regularization schemes. Extensive empirical studies validate our findings and proposed methods. Code is available at https://anonymous.4open.science/r/Rethink_Reg-07F0.

1. INTRODUCTION

Graph Neural Networks (GNNs) (Dwivedi et al., 2020; Wu et al., 2019) have achieved remarkable progress on various graph-related tasks (Mahmud et al., 2021; Zhao et al., 2021; Guo & Wang, 2021) . However, GNNs usually work as a black box, making the decision-making process obscure and hard to interpret (Ribeiro et al., 2016) . Hence, answering the question: "What knowledge does the GNN use to make a certain prediction?" is becoming crucial. To solve this question, most prior studies (Yu et al., 2021; Miao et al., 2022) realize post-hoc explainability by extracting the informative yet sparse subgraphs as explanations, following the principle of graph information bottleneck (GIB) (Wu et al., 2020) . The common paradigm of these explainers can be summarized as the relay race of feature attribution and selection. Specifically, feature attribution distributes the prediction to the input features and traces their importance, and feature selection sequentially fills features into the explanatory subgraph according to the importance rank, where regularization terms are introduced to constrain subgraph properties like size and connectivity. However, existing explainers allocate little attention to the role of regularization in them, but is the focus of our work. On the one hand, without digging deeper into regularization theoretically, we hardly acquire a plain picture of how regularization specifically affects the process of feature attribution and selection. Furthermore, some regularization in existing explainers lacks concrete theoretical support and is seemingly not more than an empirical trick. For example, GNNExplainer (Ying et al., 2019) leverages the l 1 norm to constrain the magnitude of masks and selects the edge with larger importance (i.e., larger mask). The key here is not the absolute magnitude of the mask (i.e., l 1 norm), but rather the relative magnitude between the masks. Thus, we argue that the necessity of l 1 norm needs more theoretical support. In sight of this, we endeavor to rethink the role of regularization in GNNs explainability from the perspective of information theory. Before starting, we first reshape the principle of GIB as GIBE (i.e., new GIB form tailored for GNNs Explainability) in the language of feature attribution and selection. Specifically, GIBE unifies the current explanation methods via formulating the optimization objective of these two phases. It further explores the roles of regularization in two phases respectively. Guided by these explorations, we reveal the essence of regularization and propose four intriguing propositions in terms of it. We believe a better theory of regularization is fundamental: • The essence of regularization: Regularization in GNNs explainability is essentially the tradeoff scheme to pursue the balance between the phases of feature attribution and selection (Section 3.2). • On Sparsity: The optimal coefficients of regularization are proportional to the sparsity of the explanation, that is, high sparsity should require large regularization and vice versa (Section 4.1). • On stochastic mechanism: Existing methods imply an implicit regularization effect of stochastic mechanism, which endows GNNs explainability with better compressibility (Section 4.2). • On OOD issue: The contradictory effects of regularization on two phases are responsible for the OOD issue in the post-hoc explainability (Section 4.3). Furthermore, based on these findings, we propose two common optimization methods, which can bolster the performance of current explainers via sparsity-adaptive and OOD-resistant regularization schemes. Extensive empirical studies validate our findings and proposed methods in Section 5.

2. PRELIMINARY AND RELATED WORK

GNNs explainability. While GNNs have achieved remarkable success in node classification (Zhou et al., 2019; Sankar et al., 2019) , graph classification (Zhang et al., 2018; Chen et al., 2019) , and link prediction (Ying et al., 2018; You et al., 2020) tasks, in this work, we focus on the scenario of interpreting graph classification task comprising the data distribution D and the classifier f ′ . Specifically, the input graph G = (X, A) is independent and identically distributed (IID) from D, where X is the features of all nodes and A is the adjacency matrix. Following Miao et al. (2022) , we assume that there exists a subgraph G * such that the label Y for graph G is determined by Y = f (G * ) + ϵ for some ϵ as noise independent of G, where f is an invertible function projecting the set of subgraphs to the label space. Guided by the target of searching G * , current explainers mainly leverage feature attribution and selection to extract the subgraph (Wang et al., 2021) . Feature attribution. Current explainers mainly perform feature attribution by leveraging: • Gradient-like signals w.r.t. the graph structure (Baldassarre & Azizpour, 2019; Pope et al., 2019) . For example, SA (Baldassarre & Azizpour, 2019) directly calculates the gradients of GNN's loss w.r.t. adjacency matrix as the importance scores of edges; • Attention scores of structural features (Luo et al., 2020; Miao et al., 2022) . For example, GSAT (in its post-hoc working mode) (Miao et al., 2022) trains a parameterized predictor to generate the stochastic attention for each edge as their importance; • Mask scores of structural features (Ying et al., 2019; Wang et al., 2021) . For example, GNNExplainer (Ying et al., 2019) adds soft masks to the input features and trains them by maximizing the mutual information between the masked outcome and target prediction; • Prediction changes on structure perturbations (Yuan et al., 2021; Lin et al., 2021) . For example, PGMExplainer (Vu & Thai, 2020) collects the prediction change on the random node perturbations and learns a Bayesian network from these observations. Feature selection. With attribution scores of features, input features are sequentially filled into the set of salient features to generate the explanatory subgraph according to their importance rank. Many regularization terms are introduced to guide this process. For example, sparsity constraints (Ying et al., 2019; Schlichtkrull et al., 2021) typically leverage the l 1 norm to guarantee that the selected subgraph remains within a prescribed size; connective constraints (Luo et al., 2020; Wang et al., 2021) give more selective probabilities to the edges connecting with the part selected already; more recently, information bottleneck constraints (Miao et al., 2022) are proposed to squeeze the mutual information between the input graph and the selected subgraph.

3. RETHINKING THE ROLE OF REGULARIZATION

In this section, we rethink the role of regularization in GNNs explainers. We start with a new form of graph information bottleneck tailored for explainability and the formulation of the feature attribution and selection (Section 3.1). Guided by the above theory, we analyze the effect of regularization in two phases, respectively (Section 3.2). Figure 1 : The role of regularization in the phases of feature attribution and feature selection.

3.1. GRAPH INFORMATION BOTTLENECK FOR EXPLAINABILITY

The principle of GIB is widely leveraged to guide the subgraph generation by prevailing GNNs explainers (Wu et al., 2020; Miao et al., 2022) . The general formulation of GIB is shown as follows: • Definition 1 (GIB (Yu et al., 2021) ) Given an input graph G and its label Y, GIB seeks for a maximally informative yet sparse subgraph by optimizing the following objective: arg max Gs I (G s ; Y) -βI (G s ; G) , s.t. G s ∈ G sub (G), where G sub (G) indicates the set of all subgraphs of G and β is the Lagrangian multiplier. To instantiate the objective of GIB, considering the discreteness and non-differentiability of subgraph G s , existing explanation methods typically take G ⊙ M as the proxy of G s , where M is the explanatory mask sharing the same size with A. Concretely, for mask-based methods, M is the trainable masks; for attention-based methods, M is the dualization of the attention matrix; and for perturbation-based methods, M is the matrix recording corresponding features are perturbed or not. In sight of this, we first replace G s in Equation 1 with G ⊙ M. Moreover, according to the invariance of the mutual information (MI) to invertible transformation, we rewrite Equation 1 to introduce the new GIB form tailored for GNNs explainability. Detailed derivation is provided in Appendix A.1. • Definition 2 (GIBE) Given an input graph G and its label Y, GIBE seeks for the explanatory mask M to generate the explanation G ⊙ M by optimizing the following objective: arg max M I (G ⊙ M; Y) feature attribution +α [I (G ⊙ M; G * ) -I (G ⊙ M; G)] f eature selection . ( ) where α is the tradeoff parameter which equals to β/(1 -β) for β in Definition 1. Theoretically, employing Data Processing Inequality (DPI) (Cover & Thomas, 2006) along the Markov chain, G * → G → Y, the optimal solution M of Equation 2 can be proved to be equal to the adjacency matrix of G * . We provide detailed derivation in Appendix A.2. Note that GIBE is the first attempt to in-depth combine the principle of GIB and the common paradigm of the post-hoc explainability (i.e., feature attribution and selection). Specifically, the first term of Equation 2 is the optimization objective of feature attribution, which maps the information of feature importance in G to mask, G → G ⊙ M; and the second term is the objective of feature selection, which maps the information in the mask to subgraph, G ⊙ M → G s . In conclusion, GIBE endows the GIB with better ability to directly guide the construction of the explanation methods.  I (G ⊙ M; Y) = E G⊙M,Y [log P θ (Y | G ⊙ M)] + H(Y) + E G⊙M [KL (P (Y | G ⊙ M) || P θ (Y | G ⊙ M))] . ( ) Detailed derivation is provided in Appendix A.3. Treating classifier f ′ as the proxy function of P θ , the first term of Equation 3 is specified as the cross-entropy between Y and f ′ (G ⊙ M). In this case, since the second term H(Y) is a constant, the gap between I (G ⊙ M; Y) and above cross-entropy, shown as gap1 in Figure 1 , solely depends on the third term of Equation 3. Note that this term is also the Kullback-Leibler (KL) Divergence between P (Y | G ⊙ M) and f ′ (G ⊙ M). To bridge this gap and maximize the effectiveness of feature attribution, regularization should play a role in constraining P (Y | G ⊙ M) to f ′ . Since f ′ is trained for data distribution P(G, Y), this role transfers to constrain G ⊙ M to G. In this case, the constraint orientation of M should be loosened to A for squeezing gap in feature attribution to 0, as shown in the left side of Figure 1 . Regularization in feature selection. Feature selection typically binarizes M to achieve the mapping G ⊙ M → G s . To bridge the gap between M and its binaryzation M ′ , shown as gap2 in Figure 1 , the optimization objective of feature selection, 2is usually maximized by: (1) discrete constraint (Ying et al., 2019) and connectivity constraint (Luo et al., 2020) , which are work for maximizing I (G ⊙ M; G * ); I (G ⊙ M; G * ) -I (G ⊙ M; G) in Equation (2) sparsity constraint (Schlichtkrull et al., 2021) , which leverages the l 1 or l 2 norm to minimize I (G ⊙ M; G). That is, in the phase of feature selection, regularization plays a role in constraining G ⊙ M to G s . In other words, the constraint orientation of M should be tightened to A s (i.e., adjacency matrix of G s ) for squeezing the gap in feature selection to 0, as shown in the right side of Figure 1 . In conclusion, while the constraint orientation of M is beneficial to feature attribution and the gap in feature attribution tends to zero, the gap in feature selection will inevitably become large, and vice versa. In other words, since the role of regularization is totally different across two phases, regularization is essentially a tradeoff to guarantee the mapping effectiveness both in two phases: • Proposition 1 (Essence of Regularization) Regularization in GNNs explainability is essentially the tradeoff scheme to pursue the balance between the phases of feature attribution and selection.

4. PROPOSITIONS OF REGULARIZATION IN EXPLANATION METHODS

Figure 2 : The rationale of sparsity adaptive regularization scheme. In this section, we derive three intriguing propositions stemming from GIBE and Proposition 1, which respectively reveal the relation between regularization and three important concepts in GNNs explainability (i.e., sparsity (Ying et al., 2019; Lucic et al., 2022) , stochastic mechanism (Luo et al., 2020; Wang et al., 2021) and OOD issue (Miao et al., 2022; Wu et al., 2022) ). Inspired by these propositions, we propose two simple yet effective regularization schemes.  I (G ⊙ M; G * ) -I (G ⊙ M; G) . Therefore, we aim to derive the relation between η and the phase of feature selection in current explanation methods. The process of derivation is exhibited as follows: To calculate this objective, we first introduce the variational approximation P ϕ (G ⊙ M | G * ) for P(G ⊙ M | G * ) and Q for marginal distribution P(G ⊙ M) simultaneously. Then the lower bound of the objective can be derived as: I (G ⊙ M; G * ) -I (G ⊙ M; G) ≥ E G⊙M,G * [log P ϕ (G ⊙ M | G * )] + H(G ⊙ M) -E G⊙M,G [KL (P (G ⊙ M | G) ∥Q (G ⊙ M))] . ( ) Detailed derivation is shown in Appendix A.4. While M is inherited from feature attribution, the above lower bound is in positive relation to its first term, E G⊙M,G * [log P ϕ (G ⊙ M | G * )] since the rest of terms are constants. To calculate its first term, the variational distribution P ϕ (G ⊙ M | G * ) is defined as follow: for every two directed node pair (u, v) in G * , we sample the elements of M by m uv ∼ Bern(z) where z ∈ [0, 1] is a hyperparameter. Borrowing the above P ϕ , the first term of Equation 4 is equal to: E G⊙M,G * [log P ϕ (G ⊙ M | G * )] = (u,v)∈E * m uv log m uv z + (1 -m uv ) log (1 -m uv ) 1 -z , where E * is the set of edges in G * . Note that the sum in Equation 5has positive relation to the number of terms |E * |, which equals to (1 -η)|G|, since M is inherent and the values of every terms are fixed. That is, in the phase of feature selection, the lower bound of the objective is in negative relation to η. It is worth to mention that Equation 5 is similar with the regularization terms proposed in GSAT (Miao et al., 2022) . However, it is appropriate for mainly post-hoc explanation methods while terms proposed in GSAT is tailored for itself solely. Therefore, for an interpreting task appealing to low sparsity, the magnitude of objective in feature selection is large. Since regularization is the tradeoff scheme between two phases (Proposition 1), it should be loosened to a favorable direction for feature attribution to keep the balance, and vise versa. We illustrate this process in Figure 2 . More formally: • Proposition 2 (Sparsity-adaptive Regularization) For a certain interpreting task, let K i and K j are the optimal coefficient vector of regularization under the predefined explanatory sparsity η i and η j . For all η ∈ (0, 1) we have: η i ≥ η j ⇔ K i ≥ K j (6) According to Proposition 2, we propose a simple yet effective scheme called Sparsity-adaptive Regularization Scheme (SRS) to enhance the performance of existing explainers. Concretely, SRS first seeks for the optimal coefficients K i under the certain sparsity η i by grid search, then for other sparsity η j , SRS sets K j according to K j = ηj ηi K i . Specific implementation and experimental verification are provided in Section 5.2. In this part we focus on the relationship between the regularization and the stochastic mechanism. Recent years have witnessed a surge in research that focuses on leveraging the stochastic mechanism to enhance the performance of GNNs explainability (Luo et al., 2020; Wang et al., 2021; Miao et al., 2022) . Concretely, the probability of graph G is first factorized as P (G) = Π (i,j)∈E P (e ij ), where e ij = 1 if the edge (i, j) exists, and 0 otherwise. Then the Bernoulli distribution is employed to instantiate P (e ij ), where the stochastic mechanism is introduced in this process (Luo et al., 2020) . Meanwhile, the gumbel-softmax reparameterization trick is applied to make sure the gradient in the Bernoulli distribution is traceable (Jang et al., 2017) . However, despite the satisfactory performance of the stochastic mechanism, current researches mainly neglected the theoretical basis of the stochastic mechanism.

4.2. REGULARIZATION & STOCHASTIC MECHANISM

We argue that the stochastic mechanism bolsters the explanation methods via facilitating the feature selection. Specifically, employing the stochastic mechanism to the training stage of M can be regarded as adding random noise to M (Shwartz-Ziv & Tishby, 2017) . The entropy of the mask's distribution will continue to increase during this process under the constraints of the information bottleneck. That in turn maximizes the conditional entropy H(G|G ⊙ M). Since the entropy of input graph H(G) is a constant, mutual information I(G ⊙ M, G) = H(G) -H(G|G ⊙ M) is minimized contributed to stochastic mechanism. In conclusion, the stochastic mechanism can accelerate the process of minimizing I(G ⊙ M, G), thus endowing the compression of the explanatory subgraph. That is, it can be regarded as the implicit regularization term in favor of the phase of feature selection. We summarize this rationale as Proposition 3 and verify it in Section 5.3. • Proposition 3 (Rationale of Stochastic Mechanism) Stochastic mechanism works as the implicit regularization term, which endows GNNs explainability with better compressibility.

4.3. REGULARIZATION & OOD

Many recent endeavors have been made towards revealing the out-of-distribution issue of the posthoc explanation methods (Chang et al., 2019; Qiu et al., 2022; Wu et al., 2022; Miao et al., 2022) . Concretely, OOD issue is posed in the data space since the distribution of full graph P(G) differs from that of subgraph P(G s ) w.r.t. some properties of graph data such as size (Bevilacqua et al., 2021) , degree (Tang et al., 2020) and homophily (Lei et al., 2022) . Thus it is fallacious to leverage the original GNN f ′ , which is trained for the full graphs, to estimate the MI in terms of subgraphs. Unfortunately, most previous explainers (Ying et al., 2019; Luo et al., 2020; Vu & Thai, 2020; Miao et al., 2022)  simply feed G ⊙ M into f ′ , unaware of that I (f (G ⊙ M); Y) is not proportional to I (G ⊙ M; Y), sometimes not even close. To reveal the relationship between OOD issue and regularization, we scrutinize the OOD between P(G) and P(G s ) and decouple it into two phases: (1) OOD-AT between P(G) and P(G ⊙ M) posed in feature attribution and (2) OOD-SE between P(G ⊙ M) and P(G s ) posed in feature selection. OOD-AT. We first focus on OOD-AT posed in feature attribution. According to Equation 3, feature attribution searches for the optimal M via calculating the cross-entropy H(f (G ⊙ M), Y) to approximate I (G ⊙ M; Y). The approximation error, also be illustrated as gap1 in Figure 1 , is given as the sum of second and third term of Equation 3. For the second term, H(Y), since it is the constant across different candidates M, it will not affect the search results. Unfortunately, for the third term, KL (P (Y | G ⊙ M) || P θ (Y | G ⊙ M)) , OOD-AT makes this part far away from zero and inevitably fluctuate across different M. This fluctuation degrades the fairness of the searching process of optimal M, and further degenerates the effectiveness of the mapping, G → G ⊙ M in feature attribution. Thus we formulate the tangible impact caused by OOD-AT as: OOD-AT ∝ D M [KL (P (Y | G ⊙ M) ∥P θ (Y | G ⊙ M))], ) where D is the Variance. If regularization pushes P(G ⊙ M) close to P(G) via constraining M to A, the Equation 7 will close to zero and OOD-AT will cause little degradation. Note that this process is similar to the left side of Figure 1 : the constrain orientation is loosened to A to eliminate the gap.

OOD-SE.

We then focus on OOD-SE posed in feature selection. While feature selection binarizes M to get M ′ and generates G s = G ⊙ M ′ , the degree of OOD-SE can be simplified as the distance between M and A s . That is, OOD-SE issue will be remedied if regularization pushes P(G ⊙ M) close to P(G s ) via constraining M to A s . In conclusion, just similar to the dilemma of regularization between two phases described in Section 3.2, the optimal regularization scheme for remedying OOD-AT and OOD-SE is totally different. That is, for a certain interpreting task, there is inherent uncooperativeness between OOD in feature attribution and OOD in feature selection. Note that the above analysis also explains why the OOD issue is the inherent limitation of post-hoc explainability: since we can only remedy one side and ignore the other side, OOD can not be thoroughly settled by adjusting the training objective and hyperparameters solely. We formulate this proposition as: • Proposition 3 (Rationale of OOD) For a certain interpreting task, the OOD issue is imputed to the contradictory effects of regularization on the phases of feature attribution and selection. Even though the OOD issue is inherent, we introduce a simple yet effective regularization scheme called OOD-resistant Regularization Scheme (ORS) to alleviate the above dilemma to some extent. Specifically, in the early stage of training, loose regularization of M is performed in favor of alleviating OOD-AT for the more precise mapping G → G ⊙ M. Similarly, we then gradually tighten the regularization of M to alleviate the OOD-SE for the mapping G ⊙ M → G s . ORS can achieve better explanation performance than the common invariable regularization scheme. Specific implementation and experimental verification are provided in Section 5.3.

5. EXPERIMENT

In this section, we conduct extensive experiments to answer the following questions: • BA-3motifs (Ying et al., 2019) contains 3,000 graphs attaching with the motif types: house, cycle, or grid, for which a trained ASAP (Ranjan et al., 2020) , has achieved a 99.3% testing accuracy. • MUTAG (Kazius et al., 2005) contains 4,337 molecule graphs categorized into two classes based on their mutagenic effect on the Gram-negative bacterium. The well-trained Graph Isomorphism Network (GIN) (Xu et al., 2019) has achieved a 97.7% testing accuracy. • MNIST (Monti et al., 2017) superpixel converts 70,000 images into the graphs of superpixel adjacency. A trained Spline-based GNN (Fey et al., 2018) has achieved a 94.6% testing accuracy. Evaluation Metrics and Baselines. We select three commonly used metrics to evaluate our results: Accuracy, Precision, and Fidelity. Moreover, we leverage six state-of-the-art methods to verify proposed propositions and schemes, covering the GNNExplainer (Ying et al., 2019) , PGExplainer (Luo et al., 2020) , GraphMask (Schlichtkrull et al., 2021) , CF-GNNExplainer (Lucic et al., 2022) , Refine (Wang et al., 2021) and GSAT (Miao et al., 2022) . More details is provided in Appendix B.

5.2. EVALUATION OF PROPOSITION 1, PROPOSITION 2 AND SRS (RQ1)

We first focus on verifying the Proposition 1 and Proposition 2 provided in Section 3.2 and Section 4.1 simultaneously. Since the role of regularization is to keep the balance between two phases (Proposition 1), the variation of its coefficient K will break the balance and ulteriorly impose opposite impacts on the performance under different sparsity η (Proposition 2). In sight of this, we start with the optimal coefficients of regularization for ACC-AUC by grid search, then we deliberately increase K and record the accuracy under different η. Experiment results are shown in Figure 4 , where the horizontal axis represents the multiple of the increment of K and the length of marker shows the variance. Note that the performance of other baselines have similar trends to the results in Figure 4 . Table 1 : The performance of baseline explainers averaged across 10 runs. The best performing methods are bold with blue lines, and the strongest baselines are underlined. BA3-motif Mutagenicity MNIST ACC-AUC Fidelity@0.5 Precision@5 ACC-AUC Fidelity@0.5 ACC-AUC Fidelity@0. According to Figure 4 we have the following observations: • While K is changed, the fluctuation of overall performance (i.e., ACC-AUC) is not too strong (±7.25%) on average. However, the fluctuation of ACC in high sparsity and low sparsity is obviously larger (±21.86%) than that of ACC-AUC. These observations verify the role of regularization: a tradeoff to keep the balance and guarantee the overall performance at a high level. • ACC under low sparsity and ACC-AUC are decreasing with the increment of the regularization coefficient. On the contrary, ACC in high sparsity is increasing by a large margin (29.63% ↑) on average. These increases up to a maximum of 67.83% in MUTAG are counterintuitive and eccentric. Whereas, these observations exactly conform to the theoretical analysis in Propositon 2: loose constraint (i.e., large K) is beneficial to the performance in high sparsity, while tight constraint (i.e., small K) is in favor of the performance under low sparsity. While the above observations verify the Proposition 1 and Proposition 2, we now evaluate the effectiveness of the derived scheme, SRS. Specifically, for a certain interpreting task, we first seek for the optimal K under η = 0.5 by grid search, then we reduce or enlarge K proportionally to the variation of η in (0.1, ..., 0.9). Results are summarized in Table 1 . We have the following observations: • The baseline explainers enhanced by SRS outperform themselves in all cases. To be more specific, SRS achieves significant improvements over the six baselines w.r.t. fidelity by 7.9% and 7.2% averagely in MUTAG and BA3-motif, respectively. The improvement is up to a maximum of 12.8% when GNNExplainer is employed to interpret graphs in MUTAG. It demonstrates the effectiveness and universality of SRS, and verifies that SRS can be leveraged to boost the accuracy of current explainers. We attribute these improvements to the ability that SRS can allocate the most adaptive coefficient of regularization for diverse explanatory sparsity. • SRS provides much stabler explanations than the baselines for the much smaller variance. More specifically, STD of SRS outperforms baseline by a larger margin (22.8% ↓) on average. This phenomenon verifies that the performance of most explainers is not stable since it depends on the matching degree between the task and regularization terms, while SRS can avoid this dilemma by adaptively allocating regularization terms.

5.3. EVALUATION OF PROPOSITION 3, PROPOSITION 4 AND ORS (RQ2)

We now focus on verifying the Proposition 3 proposed in Section 4.2 and the Proposition 4 proposed in Section 4.3. According to Proposition 3, the stochastic mechanism endows GNNs explainability with better compressibility, which has a similar rationale to the regularization term. In sight of this, we gradually increase the magnitude of stochasticity in explainers and observe whether the performance has the same trends as the results in Figure 4 . The overall results are summarized in Figure 5 . We have the following observations: • The ACC under low sparsity and ACC-AUC are decreasing (8.76% ↓ and 19.85% ↓) with the increment of stochasticity magnitude, while the ACC under high sparsity is increasing (14.52% ↑) in this case. These increases up to a maximum of 37.03% in MUTAG. Therefore, the large magnitude of stochasticity is beneficial to the performance in high sparsity and vice versa. This effect is similar to the effect of regularization terms. In conclusion, the same trends between Figure 4 Figure 5 : The performance of baseline explainers averaged across 10 runs for different sparsity while the magnitude of stochasticity is changed. Best viewed in color. and Figure 5 verify that the stochastic mechanism has a similar rationale with the regularization terms, and further verify the Proposition 3. In terms of Proposition 4, since the degree of OOD is intractable, we directly leverage the derived scheme: ORS to remedy the OOD issue in explainers. We surmise the results in Table 2 . We find that: • The explainers enhanced by ORS get the higher accuracy on the graph classification tasks than themselves. Specifically, ORS achieves significant improvements over the strongest baselines w.r.t. ACC-AUC by 4.1% and 3.5% in MUTAG and BA3-motif, respectively. These improvements verify the reliability and effectiveness of the ORS. We contribute these improvements mainly to the advantage of remedying the OOD issue in the phases of feature attribution and selection. This observation further validates the theoretical analysis of the OOD issue in the above two phases (Proposition 4) indirectly.

6. CONCLUSION

In this work, we rethink the role of regularization in GNNs explainability from the perspective of information theory. We first retrospect the concept of GIB and derive the new GIB form tailored for GNNs explainability: GIBE. The role of regularization in the phases of feature attribution and selection are explored respectively under the guidance of GIBE. Moreover, four intriguing propositions and two common optimization schemes of regularization are introduced inspired by the above insight. Extensive experiments are conducted on both synthetic and real-world datasets to validate the rationality of our propositions and the superiority of our scheme. This work represents an initial attempt of digging deeper into the regularization in explainability theoretically, which provides a new perspective to understand the role of regularization. I(X; Y ) = I(ψ(X); ϕ(Y ))), for any invertible functions ψ and ϕ. Since Y is determined by G * in the sense that Y = f (G * ) + ϵ for some deterministic invertible function f , according to Equation 8, I (G s ; Y) in the objective of GIB (i.e., Equation 1) can be rewrote as (Miao et al., 2022) : I (G s ; Y) = I (G s ; G * ) . Thus we can decompose the first term in the objective of GIB into: I (G s ; Y) = (1 -β)I (G s ; Y) + βI (G s ; G * ) , for any β ∈ (0, 1). Comparing Equation 10 and Equation 1, the objective of GIB can be rewritten as: arg max Gs (1 -β)I (G s ; Y) + β [I (G ⊙ M; G * ) -I (G ⊙ M; G)] . We then substitute the coefficient α = β/(1 -β) into above equation: arg max Gs I (G s ; Y) + α [I (G s ; G * ) -I (G s ; G)] . Taking G ⊙ M as the substitution of G s , the new form of GIB for explainability can be derived as: arg max M I (G ⊙ M; Y) + α [I (G ⊙ M; G * ) -I (G ⊙ M; G)] , which is the objective of GIBE.

A.2 PROOF OF THE CONVERGENCE OF GIBE

Theorem A.1. For any α ∈ (0, +∞) we have: G ⊙ M = G * maximizes the objective of GIBE: I (G ⊙ M; Y) + α [I (G ⊙ M; G * ) -I (G ⊙ M; G)]. Proof. Consider the following derivation, I(G ⊙ M; Y) + α[I(G ⊙ M; G * ) -I(G ⊙ M; G)] =I (Y; G ⊙ M, G * ) -I(G * ; Y | G ⊙ M) + α[I(G ⊙ M; G * ) -I(G ⊙ M; G)] =I (Y; G ⊙ M, G * ) -I(G * ; Y | G ⊙ M) + α[I(G * ; Y, G ⊙ M) -I(G * ; Y | G ⊙ M) + I(G * ; Y | G ⊙ M) -I(G * ; G ⊙ M, Y)] =(1 + α)I(Y; G * ) -(1 + α)I (G * ; Y | G ⊙ M) + αI(G * ; Y | G ⊙ M) -αI(G * ; G ⊙ M, Y) =(1 + α)I(Y; G * ) -I(G * ; Y | G ⊙ M) -2I(G * ; G ⊙ M, Y) =(1 + α)I(Y; G * ) -I(G * ; Y | G ⊙ M) -2I(Y; G * ) -αI(G * ; G ⊙ M | Y) =I(Y; G * ) -I(G * ; Y | G ⊙ M) -αI(G * ; G ⊙ M | Y). Since I(Y; G * ) is a constant, for any α ∈ (0, +∞), G ⊙ M that maximizes the objective of GIBE also minimizes I(G * ; Y | G ⊙ M) + αI(G * ; G ⊙ M | Y). As I(G * ; Y | G ⊙ M) > 0 and I(G * ; G ⊙ M | Y) > 0, so the lower bound of I(G * ; Y | G ⊙ M) + αI(G * ; G ⊙ M | Y) is 0. We have G ⊙ M = G * can make I(G * ; Y | G ⊙ M) + αI(G * ; G ⊙ M | Y) = 0. This is because (1) Y = f (G * )+ϵ where ϵ is independent of G so that I(G * ; Y | G ⊙M) = 0 and (2) G * = f -1 (Y-ϵ) where ϵ is independent of G so that I(G * ; G ⊙ M | Y) = 0. Therefore, G ⊙ M = G * maximizes the objective of GIBE in Equation 2.

A.3 ESTIMATION OF MUTUAL INFORMATION IN FEATURE ATTRIBUTION

According to the definition of mutual information: I (G ⊙ M; Y) = E G⊙M,Y log P (Y | G ⊙ M) P(Y) . ( ) Since P (Y | G ⊙ M) is intractable, we introduce a variational approximation P θ (Y | G ⊙ M) for it. Then, we can rewrite Equation 15 as: I (G ⊙ M; Y) = E G⊙M,Y log P θ (Y | G ⊙ M) P(Y) + E G⊙M [KL (P (Y | G ⊙ M) ∥P θ (Y | G ⊙ M))] = E G⊙M,Y [log P θ (Y | G ⊙ M)] + H(Y) + E G⊙M [KL (P (Y | G ⊙ M) ∥P θ (Y | G ⊙ M))] .

A.4 VARIATIONAL BOUNDS FOR MUTUAL INFORMATION IN FEATURE SELECTION

We now focus on the lower bounds for mutual information in the phase of feature selection:  I (G ⊙ M; G * ) -I (G ⊙ M; G) . ( I (G ⊙ M; G * ) = E G⊙M,G * log P (G ⊙ M | G * ) P(G ⊙ M) . ( ) Since P (G ⊙ M | G * ) is intractable, we introduce a variational approximation P ϕ (G ⊙ M | G * ) for it. Then, we obtain a lower bound for Equation 18: I (G ⊙ M; G * ) = E G⊙M,G * log P ϕ (G ⊙ M | G * ) P(G ⊙ M) + E G⊙M,G * [KL (P (G ⊙ M | G * ) ∥P ϕ (G ⊙ M | G * ))] ≥ E G⊙M,G * log P ϕ (G ⊙ M | G * ) P(G ⊙ M) = E G⊙M,G * [log P ϕ (G ⊙ M | G * )] + H(G ⊙ M). (19) Under review as a conference paper at ICLR 2023 Then for the second term I (G ⊙ M; G) we have: I (G ⊙ M; G) = E G⊙M,G log P (G ⊙ M | G) P(G ⊙ M) . ( ) Since P (G ⊙ M) is intractable, we introduce a variational approximation Q for it. Then, we obtain a upper bound for Equation 20: I (G ⊙ M; G) = E G⊙M,G log P (G ⊙ M | G) Q(G ⊙ M) -E G⊙M,G [KL (P (G ⊙ M) ∥Q (G ⊙ M))] ≤ E G⊙M,G log P (G ⊙ M | G) Q(G ⊙ M) = E G⊙M,G [KL (P (G ⊙ M | G) ∥Q (G ⊙ M))] . (21) Plugging in Equation 19and Equation 21, we obtain a variational lower bound of Equation 17 as the objective of feature selection: I (G ⊙ M; G * ) -I (G ⊙ M; G) ≥E G⊙M,G * [log P ϕ (G ⊙ M | G * )] + H(G ⊙ M) -E G⊙M,G [KL (P (G ⊙ M | G) ∥Q (G ⊙ M))] .

B EXPERIMENT SETTING

Datasets and Target GNNs. We use one synthetic dataset and two real datasets which are publicly accessible. Three popular GNN models are trained to perform graph classification. The statistics of datasets and the configurations of GNN models are summarized in table 3. Note that some benchmark datasets may not satisfy the assumption in Section 2, for further exploration, we still take them into consideration. • MUTAG (Kazius et al., 2005; Riesen & Bunke, 2008 ) contains 4,337 molecule graphs categorized into two classes based on their mutagenic effect on the Gram-negative bacterium. • BA-3motifs (Ying et al., 2019; Luo et al., 2020) contains 3,000 graphs attaching with one of three motif types: house, cycle, and grid, where Barabasi-Albert (BA) graphs are adopted as the base. • MNIST (Monti et al., 2017; Deng, 2012) superpixel dataset converts 70,000 images into the graphs of superpixel adjacency, where every graph is labeled as one of ten digit classes. • GIN (Xu et al., 2019) suppresses the popular GNN variants, such as Graph Convolutional Networks and GraphSAGE in terms of expressive power, as it generalizes the Weisfeiler Lehman graph isomorphism test and hence achieves maximum discriminative power under the neighborhood aggregation framework. • ASAP (Ranjan et al., 2020) utilizes a self-attention network along with a modified GNN formulation to capture the importance of each node in a given graph, and learns a sparse soft cluster assignment for nodes at each layer to effectively pool the subgraphs to form the pooled graph. • Spline-based GNN (Fey et al., 2018) Evaluation Metrics. It is of crucial importance to evaluate the explanations quantitatively since human evaluations are highly dependent on their subjective understanding. Prior studies have proposed some metrics to quantitatively assess the explanations (Yuan et al., 2020; Dwivedi et al., 2020) , among which we select three commonly used metrics to evaluate our results. For clarity, we denote G K s as the explanatory subgraph by taking top-K edges in G, and |G| as the number of edges in graph G. • Predictive Accuracy (ACC@η) (Chen et al., 2018) . This metric feeds the explanatory subgraph into the target model and measures the performance of the explanation by auditing how well it recovers the target prediction, where η is the predefined sparsity. Formally, given the trained GNN models f , we have ACC where I(•, •) is the indicator function that takes value 1 when its two arguments are equal and takes value 0 otherwise. Moreover, we plot the curve of ACC over different sparsity η ∈ (0, 0.1, ..., 0.9) on the test set and denote ACC-AUC as the area under the ACC curve. Note that ACC@η and ACC-AUC do not rely on ground truth labels, and thus are suitable for all the datasets. • Precision@N (Ying et al., 2019) . This metric measures the consistency between the explanatory subgraph G s and the ground-truth subgraph G * . Concretely, the edges within G * are positive in G, while the remains are negative. In this case, precision can be adopted as the evaluation protocol. More formally, @η = E G [I(f (G), f (G ⌈(1-η)×|G|⌉ s )], Precision@N = E G G N s G * |G * | • Fidelity@p (Yuan et al., 2020). The Fidelity metric studies the prediction change by removing important input features identified by explanation methods. Formally, Fidelity @p = E G f (G) Y -f G \ G ⌈p×|G|⌉ s Y Baselines. We leverage the state-of-the-art methods to verify proposed propositions and optimization schemes, covering the followings: • GNNExplainer (Ying et al., 2019) directly learns an adjacency matrix mask through maximizing the mutual information between a GNN's prediction and distribution of possible subgraph structures, which is performed via multiplying the mask to the input features. • PGExplainer (Luo et al., 2020) adopts a deep neural network to parameterize the generation process of explanations, which makes it a natural approach to explaining multiple instances collectively. It can also provide global explanations for a certain class. • GraphMask (Schlichtkrull et al., 2021) learns a simple classifier that, for every edge in every layer, predicts if that edge can be dropped, in a fully differentiable fashion. Then by dropping edges without deteriorating the performance of the model, the remaining edges naturally form an explanation for the model's prediction. • CF-GNNExplainer (Lucic et al., 2022) focuses on the counterfactual explanations by figuring out the minimal perturbation to the input (graph) data such that the prediction changes. By instantiating the perturbation with only edge deletions, they find out the edges which are crucial for the original predictions. • Refine (Wang et al., 2021) develops a explainer that can generate multi-grained explanations by exploiting the pre-training and fine-tuning idea. Specifically, the pre-training phase exhibits global explanations with the prototypical patterns, and the fine-tuning phase further adapts the global explanations in the local context with high fidelity. • GSAT (Miao et al., 2022) leverages stochastic attention to block the information from the taskirrelevant graph components while learning stochasticity-reduced attention to select the taskrelevant subgraphs for interpretation. Though it's naturally an inherently interpretable method, it also works in a post-hoc way through a fine-tuning fashion. We adopt this post-hoc working mode as one baseline. Training Optimization and Early Stopping. All experiments are done on a single Tesla V100 SXM2 GPU (32 GB). During training, we use Adam (Kingma & Ba, 2015) optimizer. The maximum number of epochs is 200 for all datasets. We use Stochastic Gradient Descent (SGD) for the optimization of all GNNs models. The initial learning rate is set to: 10 -3 for BA3-motif , 10 -2 for MNIST, and 10 -3 for MUTAG. Also, we exhibit early stopping to avoid overfitting the training dataset. If the model's performance on the validation dataset is without improvement (i.e., validation accuracy begins to decrease) for five epochs, we stop the training process to prevent increased generalization error. Hyperparameter Settings. The most crucial hyperparameter in this work is the coefficients of regularization. For MUTAG: the initial coefficient of sparsity constraints is 0.05 and it grows at a rate of 50% each epoch; the coefficient of discrete constraints is 0.5 and it grows at a rate of 50% each epoch; For BA3, these coefficients are set to {0.04, 50%, 0.4, 50%}; For MNIST: these coefficients are set to {5 × 10 -5 , 100%, 5 × 10 -4 , 50%}. For other hyperparameters in baseline methods, we adopt a grid search for the optimal parameters using the validation datasets. To be more specific, the learning rate of Adam is tuned in {10 -3 , 10 -2 , 10 -1 }, and the weight decay is searched in {10 -5 , 10 -4 , 10 -3 }. Other model-specific hyperparameters are set as follows: For PGExplainer, the temperature for reparameterization is 0.1; For Refine, the temperature hyperparameter β is 1 and the trade-off hyperparameter γ is 5; For GSAT, the parameter of Bernoulli distribution r is fixed as 0.6. C QUALITATIVE EVALUATION OF SRS To have visual inspections on the explanatory subgraphs generated by different explainers and the effectiveness of our proposed scheme SRS, we randomly choose graph instances from the synthetic dataset BA3-motif and present them in Figure 6 . For each baseline explainer, we highlight the edges which have the top-K importance scores by red lines, where K = 6. The ground-truth nodes are highlighted in green while the turbulence nodes w.r.t. nodes in BA-motif are distinguished in blue. According to Figure 6 we can observe that: • Some blue edges which is not belonging to the ground-truth motifs are select by the baseline explanation methods. On the contrary, current explainers enhanced by our proposed scheme SRS inherit the ground-truth edges extracted by the original methods, and eliminate the edges which belong to the Barabasi-Albert (BA) graphs. • Current explainers enhanced by SRS take count of both accuracy and completeness. That is, SRS mainly focuses on the edges belonging to one completed motif while the graph contains more than one ground-truth motifs. However, baseline explainers are often distracted by multiple groundtruth motifs. • For blue nodes in Barabasi-Albert (BA) graphs, there are several more intrusive nodes which are connected to ground-truth motifs. These nodes will cause higher interference to the generation of explanatory subgraphs. Fortunately, SRS framework can avoid these traps by extracting less turbulence nodes. This phenomenon demonstrates the robustness of our proposed SRS.



REGULARIZATION IN TWO PHASES Regularization in feature attribution. Directly calculating the training objective of feature attribution, I (G ⊙ M; Y) is tough since the latent distribution P(G ⊙ M, Y) is notoriously intractable. Hence, existing explainers typically introduce the parameterized variational approximation P θ (Y | G ⊙ M) for P (Y | G ⊙ M) according to:

4.1 REGULARIZATION & SPARSITY We first focus on the relationship between regularization and predefined sparsity η of targeted subgraph G * , where η = 1 -|G * |/|G|. According to Equation 2, sparsity variation of G * only affect the objective of feature selection,

Figure 3: Rationale of stochastic mechanism in explainability: an implicit regularization term.

Figure 4: The performance of baseline explainers averaged across 10 runs for different sparsity, while the coefficient of regularization is changed. Best viewed in color.

) For the first term I (G ⊙ M; G * ), by definition:

adopts a novel convolution operator based on B-splines, which operates in the spatial domain and aggregates local features, applying a trainable continuous kernel function parametrized by trainable B-spline control values and allowing very fast training and inference computation.

Figure 6: Selected explanations in BA3-motif, where the top-6 of directed edges are highlighted by red lines. The ground-truth nodes are highlighted in green while the turbulence nodes are distinguished in blue. Best viewed in color.



Statistics of the datasets and configurations of GNN models.

ETHICS STATEMENT

In this work, we rethink the role of regularization in GNNs explainability from the perspective of information theory, where no human subject is related. We believe digging deeper into the regularization theoretically is beneficial for performing the better explanations and consequently improving the model's transparency in real-world applications.

REPRODUCIBILITY

We summarize the efforts made to ensure reproducibility in this work. ( 1 

