DAG MATTERS! GFLOWNETS ENHANCED EX-PLAINER FOR GRAPH NEURAL NETWORKS

Abstract

Uncovering rationales behind predictions of graph neural networks (GNNs) has received increasing attention over the years. Existing literature mainly focus on selecting a subgraph, through combinatorial optimization, to provide faithful explanations. However, the exponential size of candidate subgraphs limits the applicability of state-of-the-art methods to large-scale GNNs. We enhance on this through a different approach: by proposing a generative structure -GFlowNetsbased GNN Explainer (GFlowExplainer), we turn the optimization problem into a step-by-step generative problem. Our GFlowExplainer aims to learn a policy that generates a distribution of subgraphs for which the probability of a subgraph is proportional to its' reward. The proposed approach eliminates the influence of node sequence and thus does not need any pre-training strategies. We also propose a new cut vertex matrix to efficiently explore parent states for GFlowNets structure, thus making our approach applicable in a large-scale setting. We conduct extensive experiments on both synthetic and real datasets, and both qualitative and quantitative results show the superiority of our GFlowExplainer.

1. INTRODUCTION

Graph Neural Networks (GNNs) have received widespread attention due to the springing up of graph-structured data in real-world applications, such as social networks and chemical molecules Zhang et al. (2020) . Various graph related task are widely studied including node classification Henaff et al. (2015) ; Liu et al. (2020) and graph classification Zhang et al. (2018) . However, uncovering rationales behind predictions of graph neural networks (GNNs) is relatively less explored. Recently, some explanation approaches for GNNs have gradually stepped into the public eye. There are two major branches of them: instance-level explanations and model-level explanations Yuan et al. (2022) . In this paper, we mainly focus on instance-level explanations. Instance-level approaches explain models by identifying the most critical input features for their predictions. They have four sub-branches: Gradients/Features-based Zhou et al. (2016) ; Baldassarre & Azizpour (2019) ; Pope et al. (2019) , Perturbation-based Ying et al. (2019) ; Luo et al. (2020) ; Schlichtkrull et al. (2020) ; Wang et al. (2020) , Decompose-based Baldassarre & Azizpour (2019) ; Schnake et al. (2020) ; Feng et al. (2021) and Surrogate-based Vu & Thai (2020) ; Huang et al. (2022) ; Yuan et al. (2022) . Some works such as XGNN Yuan et al. (2020) and RGExplainer Shan et al. (2021) apply reinforcement learning (RL) to model-level and instance-level explanations. However, the pioneering works have some drawbacks. Perturbation-based approaches return the discrete edges for explanations, which are not as intuitive as graph generation-based approach, which could provide connected graphs. However, the task of searching connected subgraphs is a combinatorial problem, and the potential candidates increase exponentially, making most current approaches inefficient and intractable in large-scale settings. In addition, current research consider Monte-Carlo tree search, which has high variance and ignores the fact that graph is an unordered set. This could lead to a loss of sampling efficiency and effectiveness, i.e., the approaches fail to consolidate information of sampled trajectories that form the same subgraph with different sequences. To address the above issues, we take advantage of the strong generation property of Generative Flow Networks (GFlowNets) Bengio et al. (2021b) and cast the combinatorial optimization problem as a generation problem. Unlike the previous work, which focus on the maximization of mutual information, our insight is to learn a generative policy that generates a distribution of connected subgraphs with probabilities proportional to their mutual information. We called this approach GFlowExplainer, which could overcome the current predicament for the following reasons. First, it has a stronger exploration ability due to its flow matching condition, helping us to avoid the trap of suboptimal solutions. Second, in contrast to previous tree search or node sequence modeling, GFlowExplainer consolidate information from sampled trajectories generating the same subgraph with different sequences. This critical difference could largely increase the utilization of generated samples, and hence improve the performance. Moreover, by introducing a cut vertex matrix, GFlow-Explainer could be applied in large-scale settings and achieve better performance with fewer training epochs. We summarize the main contributions as follows. Main Contributions: 1) We propose a new hand-crafted method for GNN explanation via GFlowNet frameworks to sample from a target distribution with the energy proportional to the predefined score function; 2) We take advantage of the DAG structure in GFlowNets to connect the trajectories of outputting the same graph but different node sequences. Therefore, without any pre-training strategies, we can significantly improve the effectiveness of our GNN explanations; 3) Considering relatively cumbersome valid parent state explorations in GFlowNets because of the connectivity constraint of the graph, we introduce the concept of cut vertex and propose a more efficient cut vertex criteria for dynamic graphs, thus speeding up the whole process; 4) We conduct extensive experiments to show that GFlowExplainer can outperform current state-of-the-art approaches.

2. RELATED WORK

Graph Neural Networks: Graph neural networks (GNNs) are developing rapidly in recent years and have been adopted to leverage the structure and properties of graphs Scarselli et al. (2008) ; Sanchez-Lengeling et al. (2021) . Most GNN variants can be summarized with the message passing scheme, which is composed of pattern extraction and interaction modeling within each layer Gilmer et al. (2017) . These approaches aggregate the information from neighbors with different functions, such as mean/max/LSTM-pooling in GCN Welling & Kipf (2016) , GrpahSAGE Hamilton et al. (2017) , sum-pooling in GIN Xu et al. (2018) , attention mechanisms in GAT Velickovic et al. (2017) . SGC Wu et al. (2019) observes that the superior performance of GNNs is mainly due to the neighbor aggregation rather than feature transformation and nonlinearity, and proposed a simple and fast GNN model. APPNP Klicpera et al. (2018) shares the similar idea by decoupling feature transformation and neighbor aggregation. Generative Flow Networks: Generative flow networks Bengio et al. (2021a; b) aim to train generative policies that could sample compositional objects x ∈ D by discrete action sequences with probability proportional to a given reward function. This network could sample trajectories according to a distribution proportional to the rewards, and this feature becomes particularly important when exploration is important. The approach also differs from RL, which aims to maximize the expected return and only generates a single sequence of actions with the highest reward. GFlowNets has been applied in molecule generation Bengio et al. (2021a) ; Jain et al. (2022) , discrete probabilistic modeling Zhang et al. (2022) , bayesian structure learning Deleu et al. (2022) , causal discovery Li et al. (2022) and continuous control tasks Li et al. (2023) . Instance-level GNN Explanation: Instance-level approaches explain models by identifying the most critical input features for their predictions. Gradients/Features-based approaches, e.g., Zhou et al. (2016) ; Baldassarre & Azizpour (2019) ; Pope et al. (2019) , compute the gradients or map the features to the input to explain the important terms while the scores sometimes could not reflect the contributions intuitively. As for the perturbation-based approaches, GNNExplainer Ying et al. (2019) is the first specific design for explanation of GNNs, which formulates an optimization task to maximize the mutual information between the GNN predictions and the distribution of poten-tial subgraphs. Unfortunately, GNNExplainer and Causal Screening Wang et al. (2020) may lack a global view of explanations and be stuck at local optima. Even though PGExplainer Luo et al. (2020) and GraphMask Schlichtkrull et al. (2020) could provide some global insights, they require a reparameterization trick and could not guarantee that the outputs of the subgraph are connected, which lacks explanations for the message passing scheme in GNNs. Shapley-value based approaches SubgraphX Yuan et al. (2021) and GraphSVX Duval & Malliaros (2021) are computationally expensive especially for exploring different subgraphs with the MCTS algorithm. Decomposed-based approaches, for example, LRP Baldassarre & Azizpour (2019) , GNN-LRP Schnake et al. (2020) and DEGREE Feng et al. (2021) , evaluate the importance of input features by decomposing the model predictions into several terms , at the price of raising the difficulty of applying the method to complex and structured graph datasets. Surrogate-based approaches PGMExplainer Vu & Thai (2020) and GraphLime Huang et al. (2022) sample a data set from the neighbors of a given example and then fit an interpretable model for that data set. However, this approach requires a careful definition of neighboring areas, making the generalization to other problem settings highly non-trivial. As an another attempt, XGNN Yuan et al. (2020) and RGExplainer Shan et al. (2021) apply reinforcement learning to model-level and instance-level explanations respectively, while the latter requires inefficient pre-training strategies and has high variances for sampling.

3.1. PROBLEM FORMULATION

Let G = (V, E) denote a graph on nodes V and edges E with d-dimensional node features X = {x 1 , ..., x n }, x i ∈ R d . The adjacency matrix A describes the edge relationships of G, i.e., A ii = 1 for all i ∈ V and A ij = 1 for all {v i , v j } ∈ E. Â is the symmetrical adjacency matrix computed by Â = D-1/2 A D-1/2 where D is the diagonal degree matrix of A. Let Φ denote a trained GNN model, which is optimized on all instances in the training set and is then used for predictions. Given an instance, i.e. a node v or a graph G, the goal of GNN Explanation is to identify a subgraph G s = (V s , E s ) and the associated features X s = {x j |v j ∈ G s } that are important for the GNN prediction Y i = Φ(v i ) or Y gi = Φ(G i ) where g i is a graph instance. The previous works formulate this task as an optimization problem and the objective is to maximize the mutual information max Gs M I(Y, G s ) = H(Y ) -H(Y |G s ) ⇐⇒ min Gs H(Y |G s ), where M I(•) is the mutual information function, H(•) is the entropy function, ŷ is the prediction of Φ with G s as the input and H(Y |G s ) = -E Y |Gs [log P Φ (Y |G s )]. Since H(Y ) is fixed in the explanation state, the objective can be rewritten as min Gs H(Y |G s ), which is to minimize the uncertainty of Φ when the GNN computation is limited to G s . From the graph generation perspective, since there are exponential candidates for explaining for ŷ, it is not trivial to direct solve such combinatorial optimization problem. Thus we turn this optimization problem into a step-by-step generative problem (see Figure 1 ). We propose our generative structure as GFlowNets-based GNN Explainer, abbreviated as GFlowExplainer, which consists of a tuple (S, A) where S is a finite set of states, and A is the action set consisting transitions a t : s t → s t+1 . The insight comes from that we could consider G s as a compositional object. Starting from an empty graph, we can train our policy network to generate such G s by sequentially adding one neighbor node at each step t to ensure the connectivity of an explanation graph, in which G s (s t ) refers to a subgraph at state s t , and adding one node refers to an action a t making a state transition s t → s t+1 . Different from traditional optimization problems maxmizing the mutual information our objective is to construct a TD-like flow matching condition, to obtain a generative forward policy π(a t |s t ) so that P(Y, G s ) ∝ r(Y, G s ), where r(Y, G s ) is a predefined reward function based on M I(Y, G s ). The rest of the section is organised as follows: we first introduce the flow modeling of GFlowNets in Section 3.2. The crucial elements of GFlowExplainer structure are defined in Sections 3.3 and 3.4. We propose a new framework to address the connectivity problem for an effective exploration of parent states in GFlowExplainer in 3.5. Either outflows F (s t+1 → S c ) or reward r(s f ) is calculated based on the stopping criteria.

3.2. FLOW MODELING

Flows and Probability Measures: Following Bengio et al. (2021b) , Consider a direct acyclic graph (DAG) G = (S, A), where S is a finite set of states and A is a subset of S × S representing directed edges, and each element of A corresponds to the state transition a t : s t → s t+1 . The complete trajectory is a sequence of states τ = (s 0 , ..., s n , s f ) ∈ T where s 0 is an initial state, s f is a terminal state and T is a set containing all complete trajectories. In order to measure the probabilities associated with states s, a non-negative function F (•) corresponding "flow" is introduced. F (s t , a t ) = F (s t → s t+1 ) corresponds to an edge flow or action flow. F (τ ) corresponds to the trajectory flow and the state flow is the sum of all trajectory flows passing though that state, denoted as F (s) = s∈τ F (τ ). If we fix the total flow of the DAG as Z flowing into terminal states s f to the given value r(s f ), and consider the DAG as a water pipe, in which water enters in s 0 and flows out through all s f , we can obtain Z = F (s 0 ) = F (s f ) = r(s f ). Based on this flow network, the stochastic policy π associated with the normalized flow probability P is defined as follows, π(a t | s t ) = P F (s t+1 | s t ) = P(s t → s t+1 | s t ) = F (s t → s t+1 ) F (s t ) , where P F (s t+1 | s t ) is called the forward transition probability. Then we can obtain P F (τ ) = t=f -1 t=0 P F (s t+1 | s t ), which yields P F (s) = τ :s∈τ P F (τ ) = τ ∈T I s∈τ F (τ ) τ ∈T F (τ ) = F (s) Z . ( ) Our goal is to obtain P F (s f ) = τ :s f ∈τ P F (τ ) ∝ r(s f ). State-Conditional Flow Network Following Bengio et al. (2021b) , consider a flow network based on a DAG G = (S, A) and a non-negative flow function F (•). For each state s ∈ S, the subgraph G s consists of all s ′ such that s ′ ≥ s, where ≥ follows the partial order. Then the state-conditional flow network is based on the family {G s , s ∈ S} with a conditional flow function F : S × T → R + , in which T = ∪ s∈S T s and T s is the set of trajectories in G s containing all {τ = (s, ..., s f )} such that ∀s n , s m ∈ τ F s (s n → s m ) = F (s n → s m ). Based on this definition, we have the initial flow of the state-conditional flow network refers to marginalize the terminating flows F (s ′ → s f ), i.e., for any terminating state s ′ ≥ s we have Bengio et al. (2021b) F s (s 0 | s) := F s (s) = s ′ :s ′ ≥s F (s ′ → s f ). Then, we can obtain the corresponding probability measures as the following P s (s ′ | s) = F s (s ′ → s f ) F s (s) , ∀s ′ ≥ s. The flow F (s) through state s in the original flow network could not provide the marginalization over the downstream terminating flows, we thus introduce this state-conditional flow network satisfying the desired marginalization property. In our task, considering the message passing scheme, we set to start sampling the trajectory based on a chosen starting node v 0 , thus the transition from an empty graph to that node v 0 is ignored here. We can consider that each trajectory is sampled from the subgraph family {G s0 }, where s 0 = v 0 . Then similar to the way to estimate the flow of a flow network using GFlowNet, based on the conditional state flow network, we could still train a policy to obtain P s0 (s f ) ∝ r(s f ). Based on equation 4, we can omit this subscript in the following sections.

3.3. STATES AND ACTIONS

In this subsection, we give the following definitions on states and actions, and also node neighbours and graph neighbors for following valid action set A and valid parent states in Section 3.5. Definition 1 (State) A state s t ∈ S in GFlowExplainer refers to a subgragh G s (s t ) consists of several nodes. The initial state s 0 contains a starting point v 0 and a final state s f is a subgraph attaining the stop criteria. Since we need to guarantee the connectivity of the generated subgraph G s , for every step we can only select an node from the boundary of the current subgraph G s (s t ). Definition 2 (Neighbours) There are two types of neighbors: node neighbors and graph neighbors. ∀v i , v j ∈ G s , if {v i , v j } ∈ E, then we define v i as a neighbor of node v j , denoted as v i ∈ N (v j ), and vice versa; ∀v i ̸ ∈ G s , if ∃v j ∈ G s , such that {v i , v j } ∈ E, then we define v i as a neighbor of graph G s (s t ), denoted as v i ∈ N (s t ). Simply to say, graph neighbours contain boundary nodes that have yet been selected into the subgraph. Node neighbours represent the connect relationships of each pair of nodes in the subgraph. Definition 3 (Action) An action a t : s t → s t+1 ∈ A in GFlowExplainer is to add a node from N (s t ), denoted as a t : {v i } + ∼ N (s t ). Thus making a state transition s t+1 = s t ∪ {v i }. Since we need to combine the features of all nodes in N (s t ) and G s (s t ) as the input to calculate the action distribution, for each node v i ∈ N (s t ) ∪ G s (s t ), we concatenate two indicator functions with its original feature vector x i , to distinguish the initial node v 0 and all nodes in the subgraph G s (s t ). The insights behind are: 1) for the node classification task, the generated same subgraph should have different scores for the specific node to be explained; 2) the allowed action is to select a node in N (s t ) instead of G s (s t ), which will introduce cycles for our DAG structure. Therefore, the initial feature representation X ′ t is obtained by follows, x ′ i = [x i , 1 vi=v0 , 1 {vi∈Gs(st)} ], X ′ t = [x ′ i ] ∀vi∈Gs(st)∪N (st) . Considering the associations among nodes in graph structured data, for each node v i , it is crucial to combine information from its neighbours. To achieve this, we apply APPNP, a GNN method proposed by Klicpera et al. (2018) , which separates the non-linear transformation and information propagation. We have the following update equation, H (0) t = Θ 1 X ′ t , H (l+1) t = (1 -α) ÂH (l) t + αH (0) t , where Θ 1 is the trainable weight matrix, α is a hyper-parameter used to control weight. After Llayer updates, we obtain the node representations H L t , and then feed them into a MLP to improve the representation ability: Ht (v i ) = MLP(H L t (v i ); Θ 2 ), v i ∈ G s (s t ) ∪ N (s t ), where Θ 2 is the learnable parameters in the MLP.

3.4. REWARD, STARTING NODE AND STOP CRITERIA

Similarly to Luo et al. (2020) , we use the cross-entropy function to replace the conditional entropy function H(Y | s f ) with N given instances, and define the reward function as follows: r(s f , Y ) = exp(-L(s f , Y )) = exp(- 1 N N n=1 C c=1 P (Y = c) log P (ŷ = c)), ( ) where L is the prediction loss; s f is the generated explanatory subgraph for an instance; C is the number possible predicted labels; P (ŷ = c) is the probability that the original prediction of the trained GNN Φ is c; and P (Y = c) is the probability that the label prediction of Φ on the subgraph s f is c. We use the exponential term here is to avoid negative reward. For node classification tasks, the starting node is the node instance to be interpreted. In contrast, for graph classification tasks, any node could be the potential starting node and the choice of it determines the explanation performance. Therefore, we construct a locator L to identify the most influential node in the graph similarly to Shan et al. (2021) . Given N graph instances g n , the prediction loss of the classification can be rewritten with the locator L as ŷ = Φ(π(s f |L(g n )) ). We train a three-layer MLP to model the influence of a node v i,n on the label of the graph instance g n : ω i,n = MLP([z gn , z vi,n ]), where z gn and z vi,n are respectively the feature representations of the graph g n and the node v i,n after 3-GCN layers based on the trained model Φ(•). We train this neural network based on some sampling graph instances with the Kullback-Leibler divergence loss KLDivLoss(ω i,n , -L(π(s f | v i,n ), Y gn )), so that the distribution between estimated value ω i,n is closed to -L(π(s f | v i,n ), Y gn ) asymptotically, and the softmax layers are used to transform these two values into their distributions. To obtain a compact explanation and avoid generating large subgraphs, we impose a constraint |s f | ≤ K M so that s f has at most K M nodes. We also introduce a self-attention mechanism similarly to Shan et al. (2021) , which could aggregate the feature representations: γ t (v i ) = exp(θ T 1 Ht (v i )) vj ∈N (st) exp(θ T 1 Ht (v j )) , v i ∈ N (s t ), Ht (STOP) = vi∈Gs(st)∪N (st) γ t (v i ) Ht (v i ), where the parameter θ 1 learns the attention γ t (v i ) for each node v i . We can concatenate Ht (STOP) into feature representations in equation 9. We should note that all learnable parameters above are the components in our policy network.

3.5. EFFICIENT PARENT STATE EXPLORATIONS

Flow matching condition is a crucial element in flow modeling. For current state s t , we need to explore all its direct parent states and corresponding one-step actions, i.e. s, a : T (s, a) = s t , which refers all sets (s, a) that could attain s t . However, the connectivity constraint makes exploring valid parents non-trivial since we need to guarantee that the graph is always connected. We consider this task a cut vertex exploration problem, which aims to find all vertices that will break the connectivity of a graph for each s t . If a node is a cut vertex, we can not find a valid parent state by deleting it. By taking advantage of the step-by-step generative process, we can update and store the cut vertex without repeatedly checking. Based on Definitions 4 and 5, in the following Theorem 1 we show how to update cut vertices for each step, which is proved in Appendix A.2. Definition 4 (Cut vertex matrix) A cut vertex matrix of a state s t is a dynamic matrix Z ∈ R t×t , where Z i,i = 0 and if ∃Z i,j (s t ) ̸ = 0, then we say v i is a cut vertex at s t . Definition 5 (Connectivity vector) Suppose an action a t = {v j } + , a connectivity vector of state s t is a binary vector z ∈ {0, 1} t×1 , ∀v i ∈ G s (s t ), z i (s t ) = 1 if {v i , v j } ∈ E. Lemma 1 Suppose an action a t = {v j } + , v i ∈ G s (s t ). If v i is not a cut vertex at s t , v i becomes a cut vertex from s t+1 iff |N (a t )| = 1 and {v i , v j } ∈ E, where t > 1. If v i is a cut vertex at s t , v i is not a cut vertex from s t+1 iff |N (a t ) | > 1 and a t connects to all "child groups" of the v i . Theorem 1 Staring from Z(s t ) = [0] 2×2 , for any action a t = {v j } + ∈ A, t ≥ 2, the connectivity vector of state s t is constructed by z k (a t ) = A j,k , ∀v k ∈ G s (s t ), then we update cut vertex matrix Z(s t+1 ) by Z(s t+1 ) = Z ′ (s t ) z ′ (a t ) 0 0 , ( ) where Z ′ (s t ), z ′ (s t ) are constructed based on the following equations: 1. If |N (a t )| = 1, v k ∈ G s (s t ), {v k , v j } ∈ E : ∀v m ∈ G s (s t ) Z ′ (s t ) = (1 -I t×t ) ∧ {Z(s t ) + I Z k (st)[1-z(at)]=0 z(a t ) • [1] 1×t } z ′ (a t ) = Z ′ (s t ) • z(a t ) + z(a t ) ∧ [max{Z ′ m (s t )} + 1] t×1 (16) 2. If |N (a t )| > 1, k = 0, ..., t : ∀v m ∈ G s (s t ) Z ′ m,k (s t ) = I set [I set2 max{Z m (s t ) ∧ z T (a t )} + (1 -I set2 )Z m,k (s t )] z ′ m (a t ) = I sum [max{Z ′ m (s t ) ∧ z T (a t )}] where I set = 1 iif set(Z m (s t ) ∧ z T (s t )) ̸ = set(Z m (s t )), I sum = 1 iff Z ′ m (s t )z(a t ) ̸ = 0. I set2 = 1 iff Z m,k (s t ) ∈ set(Z m (s t ) ∧ z T (a t )) . set(•) corresponds to distinct value (except 0) in a vector. Then based on Lemma 1,we have the following criteria to ensure the valid parent exploration in GFlowNets: for t ≥ 2, v i is a cut vertex at s t iff ∃Z i,j (s t ) ̸ = 0. Theorem 1 shows how we utilize dynamic graphs to efficiently update the cut vertex and thus guarantee valid parent explorations. This approach is a kind of "amortized" checking since we only need to consider additional edges from a t instead of all edges and nodes in the G s (s t ), thus having lower complexity than previous approaches. We will show the theoretical analysis in Appendix D.3.

3.6. TRAINING PROCEDURE

Starting from the starting node, GFlowExplainer draws complete trajectories τ = (s 0 , s 1 , ..., s f ) ∈ T by iteratively sampling {v i } + ∼ π(a t | s t ), until the stopping criteria is attained. After sampling a buffer, to train the policy π(s t | a t ) which satisfies P(s f , Y ) ∝ r(s f , Y ), we minimize the loss over the flow matching condition as follows L(τ ) = st+1∈τ   T (st,at)=st+1 F (s t , a t ) -I st+1=s f r(s f , Y ) -I st+1̸ =s f at+1∈A F (s t+1 , a t+1 )   2 , ( ) where T (st,at)=st+1 F (s t , a t ) denotes the inflows of a state s t+1 , at+1∈A F (s t+1 , a t+1 ) denotes the outflows of s t+1 , and r(s f , Y ) denotes the reward of the final state, which is computed by equation 10. For interior states, we only calculate outflows based on action distributions. For final states, there are no outgoing flows and we only calculate their rewards. We summarize algorithms for both node classification task and graph classification task in Appendix C.

4. EXPERIMENTS

In this section, we first introduce our experimental setup. Then we compare GFlowExplainer with a few state-of-the-art baselines GNNExplainer Ying et al. (2019) , PGExplainer Luo et al. (2020) , DEGREE Feng et al. (2021) and RG-Explainer Shan et al. (2021) in both qualitative and quantitative evaluations. Further, we evaluate the performance of our approach in the inductive setting as well as ablation experiments in Section 4.4 and Appendix D. Published as a conference paper at ICLR 2023 4.1 EXPERIMENTAL SETUP Datasets We use six datasets, in which four synthetic datasets (BA-shapes,BA-Community,Tree-Cycles and Tree-Grid) are used for the node classification task and two datasets (BA-2motifs and Mutagenicity) are used for the graph generation task. These datasets are composed of motifs and bases. Motifs are small substructures in a graph, which have been shown to play a crucial role in predicting the label of node/graph instances. Bases are the remaining parts of a graph which are randomly generated. Motifs are taken as the ground-truth and the goal of explainers is to find them. Details of these datasets are described in Appendix E.3 and the visualizations are shown in Figure 9 . Model We use the trained GNN model in Holdijk et al. (2021) , whose architecture is given in Luo et al. (2020) ; Ying et al. (2019) . Specially, for node classification, we use the model which consists of three consecutive graph convolution layers connected with a fully connected layer. For graph classification, the model includes three consecutive graph convolution layers fed into two max and mean pooling layers. The two pooling layer output embeddings are then concatenated to generate the input for a fully connected layer. Metrics The motifs in each dataset are the ground-truth explanations. The edges in the motif are positive and other edges are negative. GNNExplainer and PGExplainer return a mask matrix to represent the importance of each edge in the instance. RGExplainer and ours generate a subgraph. The explanation problem can be formalized as a binary classification task, where edges in the groundtruth motif are taken as prediction labels and the weights of edges are viewed as prediction scores. With the explanatory subgraph provided by explainers, the AUC score can be computed to measure the accuracy for quantitative evaluation.

4.2. QUALITATIVE ANALYSIS

We evaluate the single-instance explanations for the topology-based prediction task without node features in Figure 2 , in which the dots in green are our predicted nodes in motif, representing the critical nodes for GNN predictions. In contrast, the dots in orange are predicted nodes not in motif, referring to the irrelevant nodes for GNN predictions. The pink dot is the node to be interpreted, also included in the subgraph. For a fair comparison, we choose the same node for each algorithm and output their generated subgraphs. As illustrated in the figure, house, cycle, and tree motifs are identified by GFlowExplainer and have relatively fewer irrelevant nodes and edges. However, in the BA-Community dataset, RGExplainer fails to find the motif. For graph classifications, we visualize the explanation result for the BA-2motif dataset, and both approaches could find the five-node cycle motif for label 1. We next show the quantitative results in Table 1 . We run 10 different seeds for each approach and compute the average AUC scores and their standard deviations. From the table, we can find that our GFlowExplainer performs the best on five datasets. The difference between GFlowExplainer and the runner-up algorithm is not particularly noticeable on the BA-Shapes and MUTAG datasets. However, on the Tree-Cycles and BA-2motif datasets, GFlowExplainer improves the performance and shows its superiority to other graph generation and perturbation approaches. We should notice that without pre-training, the AUC scores of the RGExplainer on Tree-Cycles and Tree-Grid are always 0.5, while GFlowExplainer does not need any pre-training process and could access the ground-truth motif better. Even though on BA-Community datasets, GFlowExplainer is a runner-up algorithm, it is not far away from DEGREE.

4.4. INDUCTIVE SETTING WITH ABLATION EXPERIMENTS

To further show the effectiveness of our proposed theorem and the generalization ability of GFlow-Explainer, we conduct the ablation experiments with various cases and test the performance of GFlowExplainer in the inductive setting. We compare GFlowExplainer with GFlow-Sequence, a GFlowNets-based approach with the same state encoding, action space, reward function, and objective function. The difference lies in that the state is considered as a sequence and for each state s t , there is only one parent state s t-1 = s t /{v i }, where a t-1 = {v i } + , which is similar to RGExplainer. We also compare our GFlowExplainer with RGExplainer-Nopretrain and RGExplainer. Specifically, we vary the training set sized from {10%, 30%, 50%, 70%, 90%} and take the remaining instances for testing. For each dataset, we run the experiments 5 times and compute the average AUC scores. For fairness, we set the same parameters for each method. The comparison results are shown in Figure 3 . As for BA-Shapes and Tree-Cycles, since they already have enough training samples for GFlowexplainer when the ratio is 10%, the performances of it are always good enough and fall in certain intervals. We also note that in some seeds, RGExplainer dropped sharply from the initial AUC of 0.77 to 0.5. We conjecture that the policy gradient and Monte Carlo estimation may suffer from high variances and be unstable, which may not be able to generate an explanation consistently. In contrast, GFlowExplainer does not need any pre-training strategies and could provide more consistent explanations. Finally, we discuss more the properties of DAG and why it becomes the critical ingredient of best performance in our work in Appendix B.1.

5. CONCLUSION

In this work, we present GFlowExplainer to provide the instance-level explanations for GNNs. The DAG structure in our method eliminates the influence of node sequence and thus without any pretraining strategies, we could provide faithful and consistent explanations with the ensurance of the message passing nature of GNNs. We also propose a specific approach for checking cut vertices in dynamic graphs, thus accelerating the process of direct parents exploration during the training process. Extensive experiments confirm the efficiency and strong generative ability of GFlowExplainer. 

A PROOF OF MAIN RESULTS

A.1 PROOF OF LEMMA 1 Definition 6 A node v i is a cut vertex iif ∃v a , v b ∈ G, such that there is no node sequence ⃗ v = (v a , ..., v b ) along edges if v i ̸ ∈ ⃗ v , which means v a could not attain v b without passing through v i , and vise versa. First we should note since t > 1, there are at least 2 nodes in G s (s t ). Suppose v i is not a cut vertex at s t , then based on Definition 6, ∀v a , v b ∈ G s (s t ) such that ∃⃗ v = (v a , ..., v b ) in which v i ̸ ∈ ⃗ v. Suppose a t = {v j } + . Since there is no node deleted, we have ∀v a , v b ∈ G s (s t+1 )/{v i , v j }, ∃⃗ v = (v a , ..., v b ) in which v i ̸ ∈ ⃗ v based on the connectivity of G s (s t ). If v i becomes a cut vertex at s t+1 , we could have ∀v a ∈ G s (s t+1 ), v a ̸ = v i , v a ̸ = v j such that ̸ ∃⃗ v = (v a , ..., v j ) where v i ̸ ∈ ⃗ v. Thus we have {v i , v j } ∈ E. If |N (a t )| > 1, it means ∃v k ∈ G s (s t+1 ), v k ̸ = v i , {v k , v j } ∈ E, it is easy to consider there is a vertex sequence ⃗ v = (v a , ..., v k , v j ) where v i ̸ ∈ ⃗ v, thus v i is not a cut vertex based on Definition 6. Thus if v i becomes a cut vertex at s t+1 , we have {v i , v j } ∈ E, |N (a t )| = 1. If {v i , v j } ∈ E and |N (a t )| = 1, then ∀v a ∈ G s (s t+1 ), v a ̸ = v i , v a ̸ = v j , we have {v a , v j } ̸ ∈ E. Therefore, there is no sequence like ⃗ v = (v a , ..., v j ) without v i . Then based on Definition 6 above, we have v i is a cut vertex. Then we complete the proof that if v i is not a cut vertex at s t , v i becomes a cut vertex from s t+1 iff |N (a t )| = 1 and {v i , v j } ∈ E, where t > 1. Based on this, we can get the following Corollary 1. Corollary 1 Suppose an action a t = {v i } + , there are no cut vertex at s t . If |N (a t )| > 1, then there are no cut vertex at s t+1 . Next we prove that if v i is a cut vertex at s t , v i is not a cut vertex from s t+1 iff |N (a t )| > 1 and a t connects to all "child groups" of the v i . We can consider if v i is a cut vertex at s t , it looks like a parent node in a tree and there are some children of v i . Then if |N (a t )| = 1, {v i , v j } ∈ E, v j becomes a new child of v i in a tree, and thus there is a new "child group" of the v i . Thus if an action a t = {v k } + could connect all "child groups" of the v i , then these nodes could attain each other with passing through v k instead of v i , and thus v i becomes a non-cut vertex from s t+1 .

A.2 PROOF OF THEOREM 1

Next we prove the Theorem 1 by mathematical induction. 1) We first prove that t = 2, v i is a cut vertex at s t+1 iff ∃Z i,j (s t+1 ) ̸ = 0. Suppose t = 2, then Z(s t ) = [0] 2×2 . Since there are only two nodes v a , v b in G s (s t ), without loss of generality, we define a 0 = {v a } + and a 1 = {v b } + . Suppose a t = {v i } + . If |N (a t )| = 1, for example, {v a , v i } ∈ E (v b is symmetrical), then v a becomes a cut vertex according to Lemma 1. Based on equation 14, equation 16 we have z(a t ) = [1 0] T and Z ′ (s t ) = (1 -I 2×2 ) ∧ {Z(s t ) + z(a t ) • [1] 1×2 } = 0 1 0 0 z ′ (a t ) = Z ′ (s t ) • z(a t ) + z(a t ) ∧ 2 1 = 2 0 Combine these two parts based on equation 15, we have Z(s t+1 ) = 0 1 2 0 0 0 0 0 0 Since ∃Z 0 (s t+1 ) ̸ = [0] 1×(t+1) , we have v a becomes a cut vertex, and v a has two "child groups". Next we prove this by contradiction. Suppose |N (a t )| = 1, {v a , v i } ∈ E and Z 0 (s t+1 ) = [0] 1×(t+1) . Then based on equations above, we have Z ′ 0 (s t ) = [0] 1×2 and z ′ 0 (a t ) = 0. Published as a conference paper at ICLR 2023  If Z ′ 0 (s t ) = [0] 1×2 , since (1 -I 2×2 ) = 0 1 1 0 , Z(s t ) = 0 0 0 0 , we have z(a t ) • [1] 1×t = 0 0 0 0 , thus z(a t ) = [0 0] T , Z ′ (s t ) = 0 0 0 0 , z ′ (a t ) = 0 0 , where I set = 0 for both Z 0 (s t ) and Z 1 (s t ), since set(Z 0 (s t ) ∧ [1 1]) = set(Z 0 (s t )) = [0], set(Z 1 (s t ) ∧ [1 1]) = set(Z 1 (s t )) = [0]. I sum = 0 since Z ′ (s t )z(a t ) = 0. Combine these two parts based on equation 15, we have Z(s t+1 ) = 0 0 0 0 0 0 0 0 0 . Next we prove this by contradiction. Suppose |N (a t )| = 2, {v a , v i } ∈ E, {v b , v i } ∈ E and ∃Z 0,j (s t+1 ) ̸ = 0, j = 0, 1, 2, then we have three cases as follows, • If Z ′ 0,0 (s t ) ̸ = 0, which contradicts to equation 16 since (1 -I 2×2 ) = 0 1 1 0 . • If Z ′ 0,1 (s t ) ̸ = 0, then we have I set = 1, which corresponds to set(Z 0 (s t ) ∧ z T (a t )) ̸ = set(Z 0 (s t )). However, set(Z 0 (s t ) ∧ z T (a t )) = set(Z 0 (s t )) = [0] since Z 0 = [0] 1×2 , thus it contradicts to the statement. • If z ′ 0 (a t ) ̸ = 0, then we have I sum = 1, which means Z ′ 0 (s t )z(a t ) ̸ = 0, since z(a t ) = [1 1] T , then we should have Z ′ 0 (s t ) ̸ = [0] 1×2 , which contradicts to the cases above. Thus we prove that t = 2, if |N (a t )| > 1, then v i is a cut vertex at s t+1 iff ∃Z i,j (s t+1 ) ̸ = 0. Above all, for t = 2, we have proved that v i is a cut vertex at s t+1 iff ∃Z i,j (s t+1 ) ̸ = 0. 2) Next we consider t > 2, suppose at s t , we have v i is a cut vertex at s t iff ∃Z i,k (s t ) ̸ = 0, k ̸ = i. Suppose a t = {v j } + . We need to prove that v i is a cut vertex at s t+1 iff ∃Z i,k (s t+1 ) ̸ = 0, k ̸ = i. If |N (a t )| = 1, without loss of generality, we consider {v i , v j } ∈ E, v i ∈ G s (s t ), then based on equation 14, we have z(a t ) = [0 • • • 1 • • • 0] T , where z i (a t ) = 1, z k (a t ) = 0, ∀k ̸ = i. If v i is not a cut vertex, we have Z i (s t ) = [0] 1×t . If v i is a cut vertex, we have Z i,k (s t ) ̸ = 0, ∀k ̸ = i. Z i (s t ) is the row corresponding to v i . Then based on equation 16, we have Z ′ i (s t ) = (1 -I t×t ) i ∧ {Z(s t ) + I Zi(st)[1-z(at)]=0 z(a t ) • [1] 1×t } i . We should check I Zi(st)[1-z(at)]=0 and there two different cases. Before giving the proof, we introduce Lemma 2 as follows, Lemma 2 Suppose a t = {v j } + , N (a t ) = 1, v i ∈ G s (s t ). If Z i (s t )[1 -z(a t )] ̸ = 0, v j connects to a cut vertex v i , and if Z i (s t )[1 -z(a t )] = 0, v j connects to a non-cut vertex v i . a) If Z i (s t )[1 -z(a t )] = 0, which means v j connects to a non-cut vertex based on Lemma 2. If v i is not a cut vertex at s t , we have Z i (s t ) = [0] 1×t and Z ′ i (s t ) = [1 -I t×t ] i ∧ {Z i (s t ) + [z(a t ) • [1] 1×t ] i } = [1 • • • 0 • • • 1]. where Z ′ i,i (s t ) = 0, Z ′ i,k (s t ) = 1, ∀k ̸ = i. And we have z ′ i (a t ) = [Z ′ (s t ) • z(a t )] i + z i (a t ) ∧ [max{Z ′ i (s t )} + 1] = 2 Combine these two parts based on equation 15, we have Z i (s t+1 ) = [1 • • • 0 • • • 1 2], where Z i,i (s t+1 ) = 0, Z i,k (s t+1 ) = 1, k ̸ = i, k ≤ t -1, Z i,t (s t+1 ) = 2. Therefore we have set(Z i (s t )). If z k (a t ) = 1, k ̸ = i, then set(Z i (s t ) ∧ z(a t )) = [Z i,k (s t )] ̸ = set(Z i (s t )). Thus if |N (a t )| = 1, we have set(Z i (s t ) ∧ z(a t )) ̸ = set(Z i (s t )) and v i is still a cut vertex at s t+1 . Based on Lemma 1 we know only |N (a t )| = 1 will potentially introduce a new cut vertex, or add a new "child group" for the current cut vertex. If a t connects to all "child groups" of a cut vertex v i , then v i becomes a non-cut vertex from s t+1 . If |N (a t )| = m, m > 1. Since z(a t ) has m non-zero positions, then Z i (s t ) ∧ z(a t ) might have 0, m -1 or m non-zero positions as following different cases: • If Z i (s t ) ∧ z(a t ) are all zeros, z i is not a cut vertex. • If Z i (s t ) ∧ z(a t ) has m -1 non-zero positions, v i is a cut vertex and {v i , v j } ∈ E. • If Z i (s t ) ∧ z(a t ) has m non-zero positions, v i is a cut vertex and {v i , v j } ̸ ∈ E. Without loss of generality, we start with the case Z i (s t ) = [1 • • • 0 • • • 1 2], which means a t-1 = {v a } + makes v i become a cut vertex at s t . Thus Z i,k (s t ) = 1, k < t, k ̸ = i, Z i,t (s t ) = 2, Z i,i (s t ) = 0. Then we have v i has two "child groups", the first group consists of all nodes in G s (s t ) except of v i , the second group only has one node v a . Suppose a t = {v j } + and |N (a t )| = m. • If m = 2, {v a , v j } ∈ E, {v i , v j } ∈ E, then z i (a t ) = z t (a t ) = 1 and we have set(Z i (s t ) ∧ z(a t )) = [2] while set(Z i (s t )) = [1, 2]. Based on Lemma 1 we know a t does not connect two "child groups" and thus v i is a cut vertex at s t+1 . • If m = 2, {v a , v j } ∈ E and {v k , v j } ∈ E,v k ̸ = v i then z k (a t ) = z t (a t ) = 1, z i (a t ) = 0 and we have set(Z i (s t ) ∧ z(a t )) = [1, 2] = set(Z i (s t )). Based on Lemma 1 we know a t connect two "child groups" and thus v i is not a cut vertex at s t+1 . • If m > 2, {v a , v j } ̸ ∈ E, we can easily get set(Z i (s t ) ∧ z(a t )) = [1] ̸ = set(Z i (s t )). Based on Lemma 1 we know a t only connects v a and thus v i is a cut vertex at s t+1 . Above all, we complete the proof for Lemma 3.

B DISCUSSIONS B.1 PARENT EXPLORATIONS IN DAG MATTERS

Since graph G s (s t ) is generated by sequential actions, the trajectory becomes an ordered node sequence. However, we should note that the generated subgraph should be an unordered set, which means it is independent of the sequence but determined by the connectivity of the nodes. For example, for a graph consisting of three nodes, if there are pair-wise edges between these three nodes, the generated graph will be the same regardless of the order of nodes. However, if only two edges connect these three nodes, then the intermediate node as a bridge cannot be the last one added. We can conclude that the sequence matters when the ordering of adding nodes will affect the connectivity of a graph. There may be many trajectories that lead to the same state s t , while sampling a single trajectory τ each time could not contain this information. In order to solve this problem, RGExplainer Shan et al. ( 2021) applied the pre-training strategies with maximum Log-Likelihood Estimation (MLE) over all possible generated orderings for an explanatory graph Vinyals et al. (2015) . In contrast, our GFlowExplainer is modeled based on a directed acyclic graph structure, as multiple action sequences lead to the same graph, and the direct parent explorations for flow matching conditions "connect" these trajectories together, which naturally eliminates the influence of orderings. Therefore, there is no need to pretrain, and we can learn the policy to generate good enough candidate graphs. We also conduct experiments to show the importance of connectivity constraints for loss convergence in Appendix D.2.

C ALGORITHMS

We show the pseudocode of our GFlowExplainer for node classification and graph classification in Algorithm 1 and Algorithm 2 respectively. For node classification tasks, given an input graph G = (V, E) and its features X , a trained GNN model Φ and node instances I, GFlowExplainer aims to train a generative policy π(a t | s t ) and find the explanatory subgraph G (i) s for i-th node instance. Considering the prediction of a node instance is determined by its L-hop neighborhoods based on the message passing scheme in GNNs, in which L is the number of layers in the trained model Φ. During each training epoch, GFlowExplainer parallel generates s f for each node v i ∈ I based on policy π(a t | s t ) by sequential actions. For every iteration, GFlowExplainer samples a valid action a t : {v j } + ∼ π(a | s t ) s.t. v j ∈ N (s t ) based on the generative flow network to make a state transition s t → s t+1 and explore valid parents based on the updated cut vertex matrix Z(s t+1 ). The terminal state s f is generated once the stopping criteria is reached. When the epoch number E is reached, the trained policy π(a t | s t ) based on the flow matching loss generates explanatory subgraphs in the inference time for evaluation. As for the graph classification task, the difference lies in the choice of starting node. Therefore GFlowExplainer need to train an additional locator L during the training process. The final graph representations z gn and node representations z vi,n are computed based on the trained GNN model Φ. Then L is trained with policy π coordinately. repeat (For each node G (n) ∈ I, parallel do with a batch size B) D ADDITIONAL RESULTS

Algorithm 1 GFlowExplainer for node classification

3: Initialize s0 = {L(G (n) )}, Z(s0) = I2×2 4: Construct X ′ t ,

D.1 EFFICIENCY ANALYSIS

We also compare the inference time of GNNExplainer, PGExplainer, RGExplainer, and our GFlow-Explainer with the same environment. We compute the average inference time for explaining a single instance for each task and report the results in Table 2 . We could find that GNNExplainer is the slowest, and the inference time of RGExplainer, PGExplainer and our GFlowExplainer are in the same order of magnitude. Therefore, we can conclude that the GFlowNets-based framework will not require a longer inference time. Since the sampling procedure for a connected subgraph is similar between RGExplainer and GFlow-Explainer. We also report the training time of our GFlowExplainer and compare it with the RGExplainer, whose pre-training part is also included. The comparison results are shown in Table 3 . As we mentioned before, GFlowExplainer does not need pre-training process. However, we can find that pre-training strategies of RGExplainer take much time and even become dominating in the total running time. As for the time of iterative update per epoch, GFlowExplainer is overall faster than RGExplainer, which could also show the efficiency of the proposed Theorem 1 for updating cut vertices. The GFlowExplainer is more practical than other learning-based approaches for large-scale datasets.

D.2 LOSS CONVERGENCE ANALYSIS

In this section we conduct more ablation experiments to show the role of connectivity constraints for flow loss convergence. In the previous ablation experiments (refer to Section 4.4), we compare GFlow-Sequence and GFlowExplainer on the explanation performance in the inductive setting. In this section we consider add DAG structures without connectivity constraints, that is, there are (|G s (s t )| -1) direct parents ( because the node to be interpreted could not be deleted, and |G s (s t )| corresponds to the number of nodes in the subgraph ) for state s t . We call this approach is GFlow-Graph. Based on the theoretical sense, breaking connectivity constraints while exploring parent states will make the inconsistency between action space and trajectories in the DAG structure. We visualize the flow matching loss of both GFlowExplainer and GFlow-Graph. We set 5 different seeds on the BA-shape datasets, 16 batches with 80 epoches for each sampling. Figure 4 shows the flow matching loss of GFlowExplainer and GFlow-Graph. Since the original flow loss is small at the beginning of training, we expand the multiples of the regular items in the reward function to make the loss relatively high. As a result, we can find that both approaches could attain convergence, but GFlowExplainer has lower losses, which confirms our statement. Furthermore, both approaches could converge fast because once we confirm the starting point, the allowed action space reduces significantly due to the connectivity constraints and the stopping criteria, thus making the flow calculation easier. We also plot convergence analysis for other datasets in Figure 5 . We 

D.3 COMPUTATIONAL ANALYSIS OF THEOREM 1

Tarjan's strongly connected components algorithm Tarjan (1972) need to iterate all nodes and edges of the subgraph G s (s t ). The time complexity is O(|V| + |E|) for each state, which is inefficient for dynamic graphs. In addition, exploring all edges and nodes is a disaster for space complexity with a large graph and is not applicable in real-world applications. However, we can update and store the cut vertex without repeatedly checking by taking advantage of the step-by-step generative process. The idea is to snap to the properties of a cut vertex in dynamic graphs and identify conditions for transformations between cut and non-cut vertices. To show the effectiveness and efficiency of the proposed Theorem 1, in this section, we visualize the cut vertices in dynamic graphs via a simple simulation experiment. In addition, we show the time comparison between our proposed algorithm with other traditional cut vertices algorithms. We construct a 10 × 10 adjacency matrix to represent an undirected connected graph, and starting with two nodes; the subgraph adds a neighbor node sequentially. We record the time of the Tarjan's algorithm and our approach. For a fair comparison, the updating process time in our method is also included. We report the accumulated time in Figure 6 . We can find that with the increasing size of the graph, the accumulated time of Tarjan's algorithm increases sharply while our approach increases linearly. We also visualize the cut vertices exploration process in dynamic graphs in Figure 7 , in which the black dot represents the action of adding that node, pink dots correspond to the cut vertex, and the orange dots are regular nodes in the subgraph. It is easy to find that only the action node will introduce a new cut vertex or delete a cut vertex. Therefore, we can only iterate the new edges introduced by the action node and check its connectivity relationships with other nodes in the subgraph. In contrast, Tarjan's algorithm will iterate all nodes and edges in the subgraph after adding the action node; thus, it makes sense that our approach has smaller time complexity.

D.4 MORE INDUCTIVE SETTING RESULTS

Due to space limit, we add some inductive experiments in this section. The Figure 8 shows the inductive experiments of four algorithms on Tree-Grid and MUTAG Datasets. As for the reinforcement learning based approches, the performance of Tree-Grid is similar to that of Tree-Cycles. Without pre-training strategies, the AUC value of RG-NoPretrain remains at 0.5 and RGExplainer could provide better explanations with the increasing size of the training instances. The ratio change of the 2019) is the first formal approach to explain trained GNNs, which defines the problem as an optimization task to maximize the mutual information between the predicted labels and the distribution of possible subgraphs with some constraints. ) is to set it to be uniform over all the valid parents of a state s t+1 , i.e., P B (•|s t+1 ) = 1/#{s t |(s t → s t+1 ∈ A} suggested in Malkin et al. (2022) . In our case we only parameterize the former two terms. Previous TB suggests to parameterize Z θ with a constant since it considers the unconditional case. Therefore, it only need approximate the total flow Z so that Z = R(x), ∀x ∈ X . In contrast, our task applies the state-conditional GFlowNets, which means there are various subgraph flows Z s we need to approximate to get Z s = R(x|s). Simply speaking, for each G s , the Z s should be different. To show the complexity of trajectory balance in state-conditional GFlowNets, we have two attempts. First, we follow the unconditional case and parameterize Z θ with a constant for initialization. This is the same as to Malkin et al. (2022) . Our objective is to learn the parameters θ of the forward conditional policies P F (s t+1 |s t , v 0 ; θ) and log Z θ . Second, we consider the conditional flow approximation. Our objective is to learn the parameters θ of the forward conditional policies P F (s t+1 |s t , v 0 ; θ) and function log Z θ (v 0 ). Therefore, ∀τ = (s 0 , ..., s n+1 = s f ) ∈ T , we define the state-conditional trajectory balance as follows, L(τ, v 0 ; θ) = log Z θ (v 0 ) st→st+1∈τ P F (s t+1 |s t , v 0 ; θ) r(s f |v 0 ) st→st+1∈τ P B (s t |s t+1 , v 0 ) 2 . ( ) In our experiment, we train a three-layer MLP to model Z θ (•). The input is the node features of v 0 in each trajectory and the output is the approximated flow. The learning rate for Z θ (•) is 0.1, the hidden layer size is 128. We plot the loss convergences for both approaches with datasets BA-Shapes in Figure 10 . The unconditional flow could not converge at all and the AUC maintains 0.5 ∼ 0.6, the conditional flow could converge after some fluctuations, but the AUC has high variances, shown in Table 6 . We guess the reason behind is that in our task, the loss decreases with the change of both Z θ (-) and P F (-|-, -; θ). Even though the loss could converge, without good approximation of Z θ (-), we could not obtain correct P F (-|-, -; θ). In the graph structure data, most nodes have the same features (especially to the synthetic dataset), thus such conditional information does not distinguish them when feeding them into neural networks. For example, if v i and v j have the same features, we could output same Z θ (v i ) = Z θ (v j ), while with high probabilities that r(s f |v i ) ̸ = r(s f |v j ). Thus we have bias on approximations to the flow, which could further affect the approximations to P F (-|-, -; θ). However, using Flow matching loss, we have the following equation log L(v 0 ) = log T (st,at)=st+1 F (s t , a t |v 0 ) r(s f |v 0 ) + at+1∈A F (s t+1 , a t+1 |v 0 ) . For future work, the neighbor nodes of each v 0 could also pass message to it, thus aggregating them with more complex graph neural networks instead of MLP is more suitable in this study. F.2 QUALITATIVE ANALYSIS

F.2.1 GRAPH-SST2 DATASET

We add a real-data set Graph-SST2 Yuan et al. (2022) , which is a sentiment graph dataset for graph classification. It contains 70042 graphs with average 10 nodes in each graph. Each graph is labeled by its sentiment, which is either positive or negative. The node embeddings are initializes as the pre-trained BERT word embeddings. We train a GCN classifier with overall accuracy 88.7%. Since the graph sizes are different and some of them just contain a few words, we choose graphs with relatively larger size to evaluate our explainability. We should note that this dataset does not have ground-truth structures, thus we visualize the subgraphs generated by GFlowExplainer for qualitative analysis. In Figure 11 , each s f consists of green nodes, which are identified as important nodes for classification, the orange nodes are irrelevant nodes. Both graphs are correctly classified as "negative". We could find ("be failure", "because doesn't know to have fun") and ("it's frustrating to see these guys", "waste their talents ") could explain these classification decisions. 

F.2.2 MUTAG DATASET

We also visualize the results on MUTAG datasets in Figure 12 , to show subgraphs are more intuitive and human-intelligible. It is known that the carbon rings and N O 2 or N H 2 groups are tend to be mutagenic. Our GFlowExplainer could identify these connected important components with correct classification. In contrast, the PGExplainer identifies discrete edges. In addition, GNNs utilize the message passing scheme to incorporate graph structures with node features. Our GFlowExplainer could construct the connected graphs by adding nodes from boundary of the current subgraph stepby-step, which is consistent with message passing scheme and provides more clear explanations.

F.3 QUANTITATIVE COMPARISON

In this section, we compare GFlowExplainer with a shapley-value based approache SubgraphX Yuan et al. ( 2021) and DEGREE on accuracy. And also shows the fidelity and sparsity in our algorithm. The degree of fidelity assesses how closely related the explanations are to the model's predictions. It computes the difference between predictions with and without important structures. Sparsity measures the fraction of structures that are identified as important by explanation methods. Note that high Sparsity scores mean smaller structures are identified as important, which can affect the Fidelity scores since smaller structures (high Sparsity) tend to be less important (low Fidelity). (1 - |M i | |G L i | ) |M i | denotes the number of important input features (nodes/edges/node features) identified. |G L i | means the total number of features in G L i , which refers to the L-hop graph. For GFlowExplainer, the masks can be directly determined by the obtained subgraphs. fidelity = 1 N N i=1 (f (|G L i |) yi -f (| ĜL i |) gi ) ( ) Suppose k is the number of edges(nodes) inside motifs for synthetic datasets, we will show the top-k edges(nodes) ranked by their importance weights in our graph generation process. Based on the generation order of edges(nodes), we could assign different weights to them. In our GFlowExplainer, the weights of nodes have corresponding relationships with the their orders, which means the edges(nodes) with larger weights will be more likely to be added to the subgraph first. As for the accuracy calculation, we follow similar setting in SubgraphX and DEGREE for fair comparison. We choose first k nodes in each generated subgraph and check whether they are in the motif base and show the results in Table 7 . In our implementations, we found the ground-truth indexes have some inconsistencies in each github public repository, making accuracy calculation biased, thus we fix the these inconsistencies by ourselves.



Figure 1: Structure of GFlowExplainer: Sampling from a starting node v 0 (pink), for each state s t , the combined features of subgraph (pink and black) and neighbor nodes (orange) are fed into the policy network to sample an allowed action a t : {v 2 } + ∼ π(a | s t ) and obtain s t+1 . Then cut vertices are updated based on s t+1 to find valid parents set S p for calculating inflows F (S p → s t+1 ). Either outflows F (s t+1 → S c ) or reward r(s f ) is calculated based on the stopping criteria.

Figure 2: Qualitative Analysis for RGExplainer and GFlowExplainer

Figure 3: Comparison among GFlow-Squence, GFlowExplainer, RG-NoPretrain and RGExplainer in the inductive setting for synthetic datasets. GFlowExplainer has better generalizations.

Require: G = (V, E): Graph ; X : Node features ; I: Node instances ; B: batch size ; E: epoch number ; η: learning rate ; Φ: trained GNN classification model 1: repeat 2:repeat (For each node vi ∈ I, parallel do with a batch size B)3: Initialize s0 = {vi}, Z(s0) = I2×2 4: Construct X ′ t ,HL t according to equation 7, equation 8 and equation 9 5: Sample a valid action at : {vj} + ∼ π(a|st) s.t. vj ∈ N (st) 6: Make a state transition st+1 = st ∪ {vj} 7: Update Zt+1 according to equation 14 8: Explore all valid parents with (sp, ap) based on Zt+1 9: until Attain the stopping criteria 10: Calculate r(s f , Y ) based on equation 10 11: Update the parameters {Θ1, Θ2, θ1} based on ∇L(τ ) and η 12: until epoch number E is reached Ensure: Policy π(at | st) and generated explanatory subgraph G (i) s during the inference phase Algorithm 2 GFlowExplainer for graph classification Require: G (n) ∈ I: Graph instances ; B: batch size ; E: epoch number ; η: learning rate ; Φ: trained GNN classification model 1: repeat 2:

Figure 4: Flow matching loss for GFlowExplainer and GFlow-Graph

Figure 6: Time Comparison of exploring cut vertices

Figure 8: Inductive setting with ablation experiments on Tree-Grid Dataset and MUTAG Dataset

Figure 10: GFlowNets using trajectory balance. Left figure shows the different flow loss with conditional Z θ (v 0 ) or unconditional Z θ .Right figure shows the explanation subgraph for one simple node with TB, which fails to find the motif. Without correct approximation to Z, GFlowExplainer could not sample s f so that P(s f ) ∝ r(s f ).

Figure 11: The subgraphs on the Graph-SST2 Dataset. The green nodes are identified as important nodes and the orange nodes are identified as irrelevant nodes.

Figure 12: Qualitative Comparison between PGExplainer and GFlowExplainer on MUTAG dataset.

Figure 13: Sparsity and Fidelity in BA-Shape dataset

Explanation AUC (Quantitative Evaluation)

Dinghuai Zhang, Nikolay Malkin, Zhen Liu, Alexandra Volokhova, Aaron Courville, and Yoshua Bengio. Generative flow networks for discrete probabilistic modeling. arXiv preprint arXiv:2202.01361, 2022.

which contradicts to our statement that z(a t ) = [1 0] T . Thus weprove that t = 2, if |N (a t )| = 1, then v i is a cut vertex at s t+1 iff ∃Z i,j (s t+1 ) ̸ = 0.If |N (a t )| > 1, since there are only 2 nodes in G s (s t ), thus we have |N (a t )| = 2, {v a , v i } ∈ E and {v b , v i } ∈ E. Then based on Corollary 1, there are no cut vertices in s t+1 . Based on equation 14,equation 16 we have z(a t ) = [1 1] T and

Inference TimeInference Time BA-Shapes BA-Community Tree-Cycles Tree-Grid BA-2motifs MUTAG

Codes are available at https://github.com/RexYing/gnn-model-explainer

Dataset statistics Node ClassificationGraph Classification BA-Shapes BA-Community Tree-Cycles Tree-Grid BA-2motifs MUTAG

Comparisons among Flow Matching, Trajectory Balance (unconditional-Z) and Trajectory Balance (conditional-Z) Objectives on BA-Shape Dataset.



funding

and Huawei Noah's Ark Lab. This work was completed while Wenqian Li was a member of the Huawei Noah's Ark Lab for advanced study.

annex

∃Z i,k (s t+1 ) ̸ = 0, k = 0, ..., t. Since v i is not a cut vertex, |N (a t )| = 1, {v i , v j } ∈ E, based on Lemma 1 we know v i is a cut vertex at s t+1 . b) If Z i (s t )[1 -z(a t )] ̸ = 0, which means there are some some cut vertices at s t , and v j connects to a cut vertex based on Lemma 2. In this case, it is easy to consider v i will still be a cut vertex at state s t+1 since ∀v k ∈ G s (s t+1 ), v k ̸ = v j , v k ̸ = v i , v k can not attain v j along edges without passing through v i based on |N (a t )| = 1, and vice versa. Without loss of generality, supposeThis assumption is the case that a t-1 = {v a } + , |N (a t-1 )| = 1, v i becomes a cut vertex at s t . Then based on equation 16, we haveIf |N (a t )| > 1, without loss of generality, we can suppose |N (a t )| = k, k ≥ 2, then we have k non-zero positions in z(a t ). Before giving the proof, we propose Lemma 3 as follows,We need to check whether v j connects to all "child groups" of v i , then v i may not be a cut vertex after adding v j . Then there are following two cases:, then based on Lemma 3 we have v i becomes a non-cut vertex at s t+1 . Based on equation 17 we have, then based on Lemma 3 we have v i is still a cut vertex at s t+1 . Based on equation 17 we haveAbove all, for t > 2, we have proved that v i is a cut vertex at s t+1 iff ∃Z i,j (s t+1 ) ̸ = 0.Therefore, we complete the proof that for t ≥ 2, v i is a cut vertex at s t iff ∃Z i,j (s t ) ̸ = 0.If v i is a cut vertex, we have Z i,i (s t ) = 0 and Z i,k (s t ) ̸ = 0, where k ̸ = i. Then based on the precise matrix multiplication, we should haveSince v i is a cut vertex at s t , then set(Z i (s t )) has at least 2 different values based on calculations before, since we define set(•) contains distinct value except of 0.If |N (a t )| = 1, which means z(a t ) has only one non-zero position, it is easy to consider v i is still a cut vertex at s t+1 . If z i (a t ) = 1, which means {v j , v i } ∈ E, then set(Z i (s t ) ∧ z(a t )) = ∅ ̸ = 2021) proposes a decomposition-based explanation method for graph neural networks, which directly decomposes the influence of node groups in the forward pass. The decomposition rules are designed for GCN and GAT. Further, to efficiently select subgraph groups from all possible combinations, the authors propose a greedy approach to search for maximally influential node sets. Codes are available at https://github.com/Qizhang-Feng/DEGREE • RGExplainer Shan et al. (2021) utilises the Reinforcement Learning to generate the instance-level explanations for GNNs.The seed locator and stopping criteria to find the most influential node in a graph instance and check whether the generated explanatory graph are good enough.Codes are available at https://openreview.net/forum?id=nUtLCcV24hL E.2 EXPERIMENT ENVIRONMENT All experiments were conducted on a NVIDIA Quadro RTX 6000 environment with Pytorch. The parameters of GFlowExplainer are shown in Table 4 .

E.3 DETAILS ABOUT DATASET

We show the data statistics in Table 5 . In this paper we consider the following five datasets:• The BA-shapes data set consists of one Barabasi-Albert graph Barabási & Albert (1999) as the base and 80 house-structure motifs. Each motif is randomly attached to a node in BA graph and extra edges are added as noises;• The BA-community dataset is comprised of two BA-shapes with different node features generated by Gaussian distributions. The extra edges are also connect two BA-shapes;• The Tree-cycles dataset includes a multi-level binary tree as the base and 80 six-node cycle motifs. The cycle motifs are randomly attached to the tree.• The BA-motifs dataset has 1000 graphs where half of them are a BA graph attached with a house-structure motif, while the rest are a BA graph attached with a five-node cycle motif;• The Mutagenicity dataset is a real dataset, which includes 4337 molecule graphs. They can be classified as mutagenic or nonmutagenic depending on whether having N H 2 or N O 2 motifs.

