GRAPH INFORMATION BOTTLENECK FOR SUBGRAPH RECOGNITION

Abstract

Given the input graph and its label/property, several key problems of graph learning, such as finding interpretable subgraphs, graph denoising and graph compression, can be attributed to the fundamental problem of recognizing a subgraph of the original one. This subgraph shall be as informative as possible, yet contains less redundant and noisy structure. This problem setting is closely related to the well-known information bottleneck (IB) principle, which, however, has less been studied for the irregular graph data and graph neural networks (GNNs). In this paper, we propose a framework of Graph Information Bottleneck (GIB) for the subgraph recognition problem in deep graph learning. Under this framework, one can recognize the maximally informative yet compressive subgraph, named IBsubgraph. However, the GIB objective is notoriously hard to optimize, mostly due to the intractability of the mutual information of irregular graph data and the unstable optimization process. In order to tackle these challenges, we propose: i) a GIB objective based-on a mutual information estimator for the irregular graph data; ii) a bi-level optimization scheme to maximize the GIB objective; iii) a connectivity loss to stabilize the optimization process. We evaluate the properties of the IB-subgraph in three application scenarios: improvement of graph classification, graph interpretation and graph denoising. Extensive experiments demonstrate that the information-theoretic IB-subgraph enjoys superior graph properties.

1. INTRODUCTION

Classifying the underlying labels or properties of graphs is a fundamental problem in deep graph learning with applications across many fields, such as biochemistry and social network analysis. However, real world graphs are likely to contain redundant even noisy information (Franceschi et al., 2019; Yu et al., 2019) , which poses a huge negative impact for graph classification. This triggers an interesting problem of recognizing an informative yet compressed subgraph from the original graph. For example, in drug discovery, when viewing molecules as graphs with atoms as nodes and chemical bonds as edges, biochemists are interested in identifying the subgraphs that mostly represent certain properties of the molecules, namely the functional groups (Jin et al., 2020b; Gilmer et al., 2017) . In graph representation learning, the predictive subgraph highlights the vital substructure for graph classification, and provides an alternative way for yielding graph representation besides mean/sum aggregation (Kipf & Welling, 2017; Velickovic et al., 2017; Xu et al., 2019) and pooling aggregation (Ying et al., 2018; Lee et al., 2019; Bianchi et al., 2020) . In graph attack and defense, it is vital to purify a perturbed graph and mine the robust structures for classification (Jin et al., 2020a) . Recently, the mechanism of self-attentive aggregation (Li et al., 2019) somehow discovers a vital substructure at node level with a well-selected threshold. However, this method only identifies isolated important nodes but ignores the topological information at subgraph level. Consequently, it leads to a novel challenge as subgraph recognition: How can we recognize a compressed subgraph with minimum information loss in terms of predicting the graph labels/properties? Recalling the above challenge, there is a similar problem setting in information theory called information bottleneck (IB) principle (Tishby et al., 1999) , which aims to juice out a compressed data from the original data that keeps most predictive information of labels or properties. Enhanced with deep learning, IB can learn informative representation from regular data in the fields of computer vision (Peng et al., 2019; Alemi et al., 2017; Luo et al., 2019) , reinforcement learning (Goyal et al., 2019; Igl et al., 2019) and natural language precessing (Wang et al., 2020) . However, current IB methods, like VIB (Alemi et al., 2017) , is still incapable for irregular graph data. It is still challenging for IB to compress irregular graph data, like a subgraph from an original graph, with a minimum information loss. Hence, we advance the IB principle for irregular graph data to resolve the proposed subgraph recognition problem, which leads to a novel principle, Graph Information Bottleneck (GIB). Different from prior researches in IB that aims to learn an optimal representation of the input data in the hidden space, GIB directly reveals the vital substructure in the subgraph level. We first i) leverage the mutual information estimator from Deep Variational Information Bottleneck (VIB) (Alemi et al., 2017) for irregular graph data as the GIB objective. However, VIB is intractable to compute the mutual information without knowing the distribution forms, especially on graph data. To tackle this issue, ii) we adopt a bi-level optimization scheme to maximize the GIB objective. Meanwhile, the continuous relaxation that we adopt to approach the discrete selection of subgraph will lead to unstable optimization process. To further stabilize the training process and encourage a compact subgraph, iii) we propose a novel connectivity loss to assist GIB to effectively discover the maximally informative but compressed subgraph, which is defined as IB-subgraph. By optimizing the above GIB objective and connectivity loss, one can recognize the IB-subgraph without any explicit subgraph annotation. On the other hand, iv) GIB is model-agnostic and can be easily plugged into various Graph Neural Networks (GNNs). We evaluate the properties of the IB-subgraph in three application scenarios: improvement of graph classification, graph interpretation, and graph denoising. Extensive experiments on both synthetic and real world datasets demonstrate that the information-theoretic IB-subgraph enjoys superior graph properties compared to the subgraphs found by SOTA baselines.

2. RELATED WORK

Graph Classification. In recent literature, there is a surge of interest in adopting graph neural networks (GNN) in graph classification. The core idea is to aggregate all the node information for graph representation. A typical implementation is the mean/sum aggregation (Kipf & Welling, 2017; Xu et al., 2019) , which is to average or sum up the node embeddings. An alternative way is to leverage the hierarchical structure of graphs, which leads to the pooling aggregation (Ying et al., 2018; Zhang et al., 2018; Lee et al., 2019; Bianchi et al., 2020) . When tackling with the redundant and noisy graphs, these approaches will likely to result in sub-optimal graph representation. Recently, InfoGraph (Sun et al., 2019) maximize the mutual information between graph representations and multi-level local representations to obtain more informative global representations. Information Bottleneck. Information bottleneck (IB), originally proposed for signal processing, attempts to find a short code of the input signal but preserve maximum information of the code (Tishby et al., 1999) . (Alemi et al., 2017) firstly bridges the gap between IB and the deep learning, and proposed variational information bottleneck (VIB). Nowadays, IB and VIB have been wildly employed in computer vision (Peng et al., 2019; Luo et al., 2019) , reinforcement learning (Goyal et al., 2019; Igl et al., 2019) , natural language processing (Wang et al., 2020) and speech and acoustics (Qian et al., 2020) due to the capability of learning compact and meaningful representations. However, IB is less researched on irregular graphs due to the intractability of mutual information. Subgraph Discovery. Traditional subgraph discovery includes dense subgraph discovery and frequent subgraph mining. Dense subgraph discovery aims to find the subgraph with the highest density (e.g. the number of edges over the number of nodes (Fang et al., 2019; Gionis & Tsourakakis, 2015) ). Frequent subgraph mining is to look for the most common substructure among graphs (Yan & Yan, 2002; Ketkar et al., 2005; Zaki, 2005) . At node-level, researchers discover the vital substructure via the attention mechanism (Velickovic et al., 2017; Lee et al., 2019; Knyazev et al., 2019) . Ying et al. (2019) further identifies the important computational graph for node classification. Alsentzer et al. (2020) discovers subgraph representations with specific topology given subgraph-level annotation. Recently, it is popular to select a neighborhood subgraph of a central node to do message passing in node representation learning. DropEdge (Rong et al., 2020) relieves the over-smoothing phenomenon in deep GCNs by randomly dropping a portion of edges in graph data. Similar to DropEdge, DropNode (Chen et al., 2018; Hamilton et al., 2017; Huang et al., 2018) principle is also widely adopted in node representation learning. FastGCN (Chen et al., 2018) and ASGCN (Huang et al., 2018) accelerate GCN training via node sampling. GraphSAGE (Hamilton et al., 2017) leverages neighborhood sampling for inductive node representation learning. NeuralSparse (Zheng et al., 2020) select Top-K (K is a hyper-parameter) task-relevant 1-hop neighbors of a central node for robust node classification. Similarly, researchers discover the vital substructure at node level via the attention mechanism (Velickovic et al., 2017; Lee et al., 2019; Knyazev et al., 2019) .

3. NOTATIONS AND PRELIMINARIES

Let {(G 1 , Y 1 ), . . . , (G N , Y N )} be a set of N graphs with their real value properties or categories, where G n refers to the n-th graph and Y n refers to the corresponding properties or labels. We denote by G n = (V, E, A, X) the n-th graph of size M n with node set V = {V i |i = 1, . . . , M n }, edge set E = {(V i , V j )|i > j; V i , V j is connected}, adjacent matrix A ∈ {0, 1} Mn×Mn , and feature matrix X ∈ R Mn×d of V with d dimensions, respectively. Denote the neighborhood of V i as N (V i ) = {V j |(V i , V j ) ∈ E}. We use G sub as a specific subgraph and G sub as the complementary structure of G sub in G. Let f : G → R/[0, 1, • • • , n ] be the mapping from graphs to the real value property or category, Y , G is the domain of the input graphs. I(X, Y ) refers to the Shannon mutual information of two random variables.

3.1. GRAPH CONVOLUTIONAL NETWORK

Graph convolutional network (GCN) is widely adopted to graph classification. Given a graph G = (V, E) with node feature X and adjacent matrix A, GCN outputs the node embeddings X from the following process: X = GCN(A, X; W ) = ReLU(D -1 2 ÂD -1 2 XW ), where D refers to the diagonal matrix with nodes' degrees and W refers to the model parameters. One can simply sum up the node embeddings to get a fixed length graph embeddings (Xu et al., 2019) . Recently, researchers attempt to exploit hierarchical structure of graphs, which leads to various graph pooling methods (Li et al., 2019; Gao & Ji, 2019; Lee et al., 2019; Diehl, 2019; Zhang et al., 2018; Ranjan et al., 2020; Ying et al., 2018) . Li et al. (2019) enhances the graph pooling with self-attention mechanism to leverage the importance of different nodes contributing to the results. Finally, the graph embedding is obtained by multiplying the node embeddings with the normalized attention scores: E = Att(X ) = softmax(Φ 2 tanh(Φ 1 X T ))X , where Φ 1 and Φ 2 refers to the model parameters of self-attention.

3.2. OPTIMIZING INFORMATION BOTTLENECK OBJECTIVE

Given the input signal X and the label Y , the objective of IB is maximized to find the the internal code Z: max Z I(Z, Y ) -βI(X, Z), where β refers to a hyper-parameter trading off informativeness and compression. Optimizing this objective will lead to a compact but informative Z. Alemi et al. (2017) optimize a tractable lower bound of the IB objective: L V IB = 1 N N i=1 p(z|x i ) log q φ (y i |z)dz -βKL(p(z|x i )|r(z)), where q φ (y i |z) is the variational approximation to p φ (y i |z) and r(z) is the prior distribution of Z. However, it is hard to estimate the mutual information in high dimensional space when the distribution forms are inaccessible, especially for irregular graph data. Figure 1 : Illustration of the proposed graph information bottleneck (GIB) framework. We employ a bi-level optimization scheme to optimize the GIB objective and thus yielding the IB-subgraph. In the inner optimization phase, we estimate I(G, G sub ) by optimizing the statistics network of the DONSKER-VARADHAN representation (Donsker & Varadhan, 1983) . Given a good estimation of I(G, G sub ), in the outer optimization phase, we maximize the GIB objective by optimizing the mutual information, the classification loss L cls and connectivity loss L con .

4. OPTIMIZING THE GRAPH INFORMATION BOTTLENECK OBJECTIVE FOR SUBGRAPH RECOGNITION

In this section, we will elaborate the proposed method in details. We first formally define the graph information bottleneck and IB-subgraph. Then, we introduce a novel framework for GIB to effectively find the IB-subgraph. We further propose a bi-level optimization scheme and a graph mutual information estimator for GIB optimization. Moreover, we do a continuous relaxation to the generation of subgraph, and propose a novel loss to stabilize the training process.

4.1. GRAPH INFORMATION BOTTLENECK

We generalize the information bottleneck principle to learn an informative representation of irregular graphs, which leads to the graph information bottleneck (GIB) principle. Definition 4.1 (Graph Information Bottleneck). Given a graph G and its label Y , the GIB seeks for the most informative yet compressed representation Z by optimizing the following objective: max Z I(Y, Z) s.t. I(G, Z) ≤ I c . ( ) where I c is the information constraint between G and Z. By introducing a Lagrange multiplier β to Eq. 4, we reach its unconstrained form: max Z I(Y, Z) -βI(G, Z). Eq. 5 gives a general formulation of GIB. Here, in subgraph recognition, we focus on a subgraph which is compressed with minimum information loss in terms of graph properties. Definition 4.2 (IB-subgraph). For a graph G, its maximally informative yet compressed subgraph, namely IB-subgraph can be obtained by optimizing the following objective: max G sub ∈G sub I(Y, G sub ) -βI(G, G sub ). ( ) where G sub indicates the set of all subgraphs of G. IB-subgraph enjoys various pleasant properties and can be applied to multiple graph learning tasks such as improvement of graph classification, graph interpretation, and graph denoising. However, the GIB objective in Eq. 6 is notoriously hard to optimize due to the intractability of mutual information and the discrete nature of irregular graph data. We then introduce approaches on how to optimize such objective and derive the IB-subgraph.

4.2. BI-LEVEL OPTIMIZATION FOR THE GIB OBJECTIVE

The GIB objective in Eq. 6 consists of two parts. We examine the first term I(Y, G sub ) in Eq. 6, first. This term measures the relevance between G sub and Y . We expand I(Y, G sub ) as: I(Y, G sub ) = p(y, G sub ) log p(y|G sub )dy dG sub + H(Y ). H(Y ) is the entropy of Y and thus can be ignored. In practice, we approximate p(y, G sub ) with an empirical distribution p(y, G sub ) ≈ 1 N N i=1 δ y (y i )δ G sub (G subi ) , where δ() is the Dirac function to sample training data. G subi and y i are the output subgraph and graph label corresponding to i-th training data. By substituting the true posterior p(y|G sub ) with a variational approximation q φ1 (y|G sub ), we obtain a tractable lower bound of the first term in Eq. 6: I(Y, G sub ) ≥ p(y, G sub ) log q φ1 (y|G sub )dy dG sub ≈ 1 N N i=1 log q φ1 (y i |G subi ) =: -L cls (q φ1 (y|G sub ), y gt ), where y gt is the ground truth label of the graph. Eq. 8 indicates that maximizing I(Y, G sub ) is achieved by the minimization of the classification loss between Y and G sub as L cls . Intuitively, minimizing L cls encourages the subgraph to be predictive of the graph label. In practice, we choose the cross entropy loss for categorical Y and the mean squared loss for continuous Y , respectively. For more details of deriving Eq. 7 and Eq. 8, please refer to Appendix A.1. Then, we consider the minimization of I(G, G sub ) which is the second term of Eq. 6. Remind that Alemi et al. ( 2017) introduces a tractable prior distribution r(Z) in Eq. 3, and thus results in a variational upper bound. However, this setting is troublesome as it is hard to find a reasonable prior distribution for p(G sub ), which is the distribution of graph substructures instead of latent representation. Thus we go for another route. Directly applying the DONSKER-VARADHAN representation (Donsker & Varadhan, 1983) of the KL-divergence, we have: I(G, G sub ) = sup f φ 2 :G×G→R E G,G sub ∈p(G,G sub ) f φ2 (G, G sub ) -log E G∈p(G),G sub ∈p(G sub ) e f φ 2 (G,G sub ) , where f φ2 is the statistics network that maps from the graph set to the set of real numbers. In order to approximate I(G, G sub ) using Eq. 9, we design a statistics network based on modern GNN architectures as shown by Figure 1 : first we use a GNN to extract embeddings from both G and G sub (parameter shared with the subgraph generator, which will be elaborated in Section 4.3), then concatenate G and G sub embeddings and feed them into a MLP, which finally produces the real number. In conjunction with the sampling method to approximate p(G, G sub ), p(G) and p(G sub ), we reach the following optimization problem to approximatefoot_0 I(G, G sub ): max φ2 L MI (φ 2 , G sub ) = 1 N N i=1 f φ2 (G i , G subi ) -log 1 N N i=1,j =i e f φ 2 (Gi,G sub j ) . ( ) With the approximation to the MI in graph data, we combine Eq. 6 , Eq. 8 and Eq. 10 and formulate the optimization process of GIB as a tractable bi-level optimization problem: min G sub ,φ1 L(G sub , φ 1 , φ * 2 ) = L cls (q φ1 (y|G sub ), y gt ) + βL MI (φ * 2 , G sub ) (11) s.t. φ * 2 = arg max φ2 L MI (φ 2 , G sub ). ( ) We first derive a sub-optimal φ 2 notated as φ * 2 by optimizing Eq. 12 for T steps as inner loops. After the T-step optimization of the inner-loop ends, Eq. 10 is a proxy for MI minimization for the GIB objective as an outer loop. Then, the parameter φ 1 and the subgraph G sub are optimized to yield IB-subgraph. However, in the outer loop, the discrete nature of G and G sub hinders applying the gradient-based method to optimize the bi-level objective and find the IB-subgraph. 

4.3. THE SUBGRAPH GENERATOR AND CONNECTIVITY LOSS

To alleviate the discreteness in Eq. 11, we propose the continuous relaxation to the subgraph recognition and propose a loss to stabilize the training process. Subgraph generator: For the input graph G, we generate its IB-subgraph with the node assignment S which indicates the node is in G sub or G sub . Then, we introduce a continuous relaxation to the node assignment with the probability of nodes belonging to the G sub or G sub . For example, the i-th row of S is a 2-dimensional vector [p(V i ∈ G sub |V i ), p(V i ∈ G sub |V i )]. We first use an l-layer GNN to obtain the node embedding and employ a multi-layer perceptron (MLP) to output S : X l = GNN(A, X l-1 ; θ 1 ), S = Sof tmax(MLP(X l ; θ 2 )). S is a n × 2 matrix, where n is the number of nodes. We add row-wise Softmax to the output of MLP to ensure the nodes are either in or out of the subgraph. For simplicity, we compile the above modules as the subgraph generator, denoted as g(; θ) with θ := (θ 1 , θ 2 ). When S is well-learned, the assignment of nodes is supposed to saturate to 0/1. The representation of G sub , which is employed for predicting the graph label, can be obtained by taking the first row of S T X l . Connectivity loss: However, poor initialization will cause p(V i ∈ G sub |V i ) and p(V i ∈ G sub |V i ) to be close. This will either lead the model to assign all nodes to G sub / G sub , or result that the representations of G sub contain much information from the redundant nodes. These two scenarios will cause the training process to be unstable. On the other hand, we suppose our model to have an inductive bias to better leverage the topological information while S outputs the subgraph at a node-level. Therefore, we propose the following connectivity loss: L con = ||Norm(S T AS) -I 2 || F , where Norm(•) is the row-wise normalization, ||•|| F is the Frobenius norm, and I 2 is a 2×2 identity matrix. L con not only leads to distinguishable node assignment, but also encourage the subgraph to be compact. Take (S T AS) 1: for example, denote a 11 , a 12 the element 1,1 and the element 1,2 of  a 11 = i,j A ij p(V i ∈ G sub |V i )p(V j ∈ G sub |V j ), a 12 = i,j A ij p(V i ∈ G sub |V i )p(V j ∈ G sub |V j ). (15) Minimizing L con results in a11 a11+a12 → 1. This occurs if V i is in G sub , the elements of N (V i ) have a high probability in G sub . Minimizing L con also encourages a12 a11+a12 → 0. This encourages p(V i ∈ G sub |V i ) → 0/1 and less cuts between G sub and G sub . This also holds for G sub when analyzing a 21 and a 22 . In a word, L con encourages distinctive S to stabilize the training process and a compact topology in the subgraph. Therefore, the overall loss is: min θ,φ1 L(θ, φ 1 , φ * 2 ) = L cls (q φ1 (g(G; θ)), y gt ) + αL con (g(G; θ)) + βL MI (φ * 2 , G sub ) s.t. φ * 2 = arg max φ2 L MI (φ 2 , G sub ). We provide the pseudo code in the Appendix to better illustrate how to optimize the above objective.

5. EXPERIMENTS

In this section, we evaluate the proposed GIB method on three scenarios, including improvement of graph classification, graph interpretation and graph denoising.

5.1. BASELINES AND SETTINGS

Improvement of graph classification: For improvement of graph classification, GIB generates graph representation by aggregating the subgraph information. We plug GIB into various backbones including GCN (Kipf & Welling, 2017) , GAT (Velickovic et al., 2017) , GIN (Xu et al., 2019) and GraphSAGE (Hamilton et al., 2017) . We compare the proposed method with the mean/sum aggregation (Kipf & Welling, 2017; Velickovic et al., 2017; Hamilton et al., 2017; Xu et al., 2019) and pooling aggregation (Zhang et al., 2018; Ranjan et al., 2020; Ying et al., 2018; Diehl, 2019) in terms of classification accuracy. Moreover, we apply DropEdge (Rong et al., 2020) to GAT, namely GAT+DropEdge, which randomly drop 30% edges in message-passing at node-level. Similarly, we apply GIB to GAT+DropEdge, resulting in GAT+GIB+DropEdge. For fare comparisions, all the backbones for different methods consist of the same 2-layer GNN with 16 hidden-size.

Graph interpretation:

The goal of graph interpretation is to find the substructure which shares the most similar property to the molecule. If the substructure is disconnected, we evaluate its largest connected part. We compare GIB with the attention mechanism (Li et al., 2019) . That is, we attentively aggregate the node information for graph prediction. The interpretable subgraph is generated by choosing the nodes with top 50% and 70% attention scores, namely Att05 and Att07. GIB outputs the interpretation with the IB-subgraph. Then, we evaluate the absolute property bias (the absolute value of the difference between the property of graph and subgraph) between the graph and its interpretation. Similarly, for fare comparisons, all the backbones for different methods consist of the same 2-layer GNN with 16 hidden-size. Graph denoising: We translate the permuted graph into the line-graph and use GIB and attention to 1) infer the real structure of graph, 2) classify the permuted graph via the inferred structure. We further compare the performance of GCN and DiffPool on the permuted graphs. 7 of Appendix. Graph interpretation: We construct the datasets for graph interpretation on four molecule properties based on ZINC dataset, which contains 250K molecules. QED measures the drug likeness of a molecule, which is bounded within the range (0, 1.0). DRD2 measures the probability that a molecule is active against dopamine type 2 receptor, which is bounded with (0, 1.0). HLM-CLint and MLM-CLint are estimated values of in vitro human and mouse liver microsome metabolic stability (base 10 logrithm of mL/min/g). We sample the molecules with QED ≥ 0.85, DRD2 ≥ 0.50, HLM-CLint ≥ 2, MLM-CLint ≥ 2 for each task. We use 85% of these molecules for training, 5% for validating and 10% for testing. The statistics of the datasets are available in Table 8 of Appendix. Graph denoising: We generate a synthetic dataset by adding 30% redundant edges for each graph in MUTAG dataset. We use 70% of these graphs for training, 5% for validating and 25% for testing.

Improvement of Graph Classification:

In Table 1 , we comprehensively evaluate the proposed method and baselines on improvement of graph classification. We train GIB on various backbones and aggregate the graph representations only from the subgraphs. We compare the performance of our framework with the mean/sum aggregation and pooling aggregation. This shows that GIB improves the graph classification by reducing the redundancies in the graph structure. provided in the Appendix. In Table 5 , we compare the average number of disconnected substructures per graph since a compact subgraph should preserve more topological information. GIB generates more compact subgraphs to better interpret the graph property. Moreover, compared to the baselines, GIB does not require a hyper-parameter to control the sizes of subgraphs, thus being more adaptive to different tasks. Please refer to Table 9 and Table 10 of Appendix for details. The training dynamic is shown in Fig. 7 . We provide results with other MI estimators in Table 11 in Appendix. Graph denoising: Table 4 shows the performance of different methods on noisy graph classification. GIB outperforms the baselines on classification accuracy by a large margin due to the superior property of IB-subgraph. Moreover, GIB is able to better reveal the real structure of permuted graphs in terms of precision and recall rate of true edges.

5.4. ABLATION STUDY

To further understand the rolls of L con and L M I , we derive two variants of our method by deleting L con and L M I , namely GIB w/o L con and GIB w/o L M I . Note that GIB w/o L M I is similar to InfoGraph (Sun et al., 2019) and GNNExplainer (Ying et al., 2019) , as they only consider to maximize MI between latent embedding and global summarization and ignore compression. When adapted to sub graph recognition, it is likely to be G = G sub . We evaluate the variants with 2-layer GCN and 16 hidden size on graph interpretation. In practice, we find that the training process of GIB w/o L con is unstable as discussed in Section 4.3. Moreover, we find that GIB w/o L M I is very likely to output G sub = G, as it does not consider compression. Therefore, we try several initiations for GIB w/o L con and L M I to get the current results. As shown in Table 3 , GIB also outperforms the variants, and thus indicates that every part of our model does contribute to the improvement of performance.

5.5. MORE DISCUSSION ON CONNECTIVITY LOSS

L con is proposed for stabilizing the training process and resulting in compact subgraphs. As it poses regularization for the subgraph generation, we are interested in its potential influence on the sizes of the chosen IB-subgraph. Therefore, we show the influence of different hyper-parameters of L con to the sizes of the chosen IB-subgraph. We implement the experiments with α varies in {1, 3, 5, 10} on QED dataset and compute the mean and standard deviation of the sizes of IB-subgraph (All) and their largest connected parts (Max). As shown in Table 6 , we observe that different α result in similar sizes of IB-subgraph. Therefore, its influence on the size of chosen subgraphs is weak.

A APPENDIX

A.1 MORE DETAILS ABOUT EQ. 7 AND EQ. 8 Here we provide more details about how to yield Eq. 7 and Eq. 8.  I(Y, G sub ) = p( ≥ p(y, G sub ) log q φ1 (y|G sub )dy dG sub ≈ 1 N N i=1 q φ1 (y i |G subi ) = -L cls (q φ1 (y|G sub ), y gt )

A.2 CASE STUDY

To understand the bi-level objective to MI minimization in Eq. 11, we provide a case study in which we optimize the parameters of distribution to reduce the mutual information between two variables. Consider p(x) = sign(N (0, 1)), p(y|x) = N (y; x, σ 2 )foot_4 . The distribution of Y is: p(y) = p(y|x)p(x)dx = i p(y|x i )p(x i ) = p(y|x = 1)p(x = 1) + p(y|x = -1)p(x = -1) = 0.5(N (y; 1, σ 2 ) + N (y; -1, σ 2 )) We optimize the parameter σ 2 to reduce the mutual information between X and Y . For each epoch, we sample 20000 data points from each distribution, denoted as X = {x 1 , x 2 , • • • , x 20000 }, Y = {y 1 , y 2 , • • • , y 20000 }. The inner-step is set to be 150. After the inner optimization ends, the model yields a good mutual information approximator and optimize σ 2 to reduce the mutual information by minimizing L M I . We compute the mutual information with the traditional method and compare it with L M I : I(X, Y ) = p(x, y) log p(y|x) p(y) dxdy ≈ 1 20000 20000 i=1 log p(y i |x i ) p(y i ) As is shown in Fig . 9 , the mutual information decreases as L M I descends. The advantage of such bi-level objective to MI minimization in Eq. 11 is that it only requires samples instead of forms of distribution. Moreover, it needs no tractable prior distribution for variational approximation. The drawback is that it needs additional computation in the inner loop.

A.3 ALGORITHM

The algorithm is shown as following: θ ← θ 0 , φ 1 ← φ 0 1 3: for i = 0 → N do 4: φ 2 ← φ 0 2 5: for t = 0 → T do 6: φ t+1 2 ← φ t 2 + η 1 ∇ φ t 2 L MI 7: end for 8: θ i+1 ← θ i -η 2 ∇ θ i L(θ i , φ i 1 , φ T 2 ) 9: φ i+1 1 ← φ i 1 -η 2 ∇ φ i 1 L ( θ i , φ i 1 , φ T 2 ) 10: end for 11: G sub ← g(G; θ N ) 12: return G sub 13: end function

A.4 MORE RESULTS ON GRAPH INTERPRETATION

In Fig. 4 , we show the distribution of absolute bias between the property of graphs and subgraphs. GIB is able to generate such subgraphs with more similar properties to the original graphs. In Fig. 5 , we provide more results of four properties on graph interpretation.

A.5 MORE RESULTS ON NOISY GRAPH CLASSIFICATION

We provide qualitative results on noisy graph classification in Fig. 6 .

A.6 DETAILS OF DATASETS

We provide the statistics of datasets in experiments. For graph classification, we evaluate the proposed method on four datasets, including MUTAG, PROTEINS, IMDB-BINARY and DD. The In the graph interpretation task, the hyper-parameter of L con , α, is set to be 5 on four datasets. We show the mean and standard deviation of the sizes of subgraphs in percent in Table 9 and Table 10 . Note that the sizes of chosen subgraphs mainly depend on task relevant information. For example, as DRD2 measures the probability of being active against dopamine type 2 receptor, it depends on almost the whole structure of a molecule. In contrast, HLM-CLint measures vitro human microsome metabolic stability, which is greatly influenced by small motifs. As shown in Table 9 and Table 10 , GCN+GIB can recognize the subgraphs with adaptive sizes on different tasks, leading to better performance. However, in GCN+Att05 and GCN+Att07, the size of subgraphs is explicitly controlled by the hyper-parameter (preserve top 50% or 70 % nodes with the highest attention scores). Therefore, the performances of these methods are limited.



Notice that the MINE estimator(Belghazi et al., 2018) straightforwardly uses the DONSKER-VARADHAN representation to derive an MI estimator between the regular input data and its vectorized representation/encoding. It cannot be applied to estimate the mutual information between G and G sub since both of G and G sub are irregular graph data. We follow the protocol in https://github.com/rusty1s/pytorch geometric/tree/master/benchmark/kernel CONCLUSIONIn this paper, we have studied a subgraph recognition problem to infer a maximally informative yet compressed subgraph. We define such a subgraph as IB-subgraph and propose the graph information bottleneck (GIB) framework for effectively discovering an IB-subgraph. We derive the GIB objective from a mutual information estimator for irregular graph data, which is optimized by a bi-level learning scheme. A connectivity loss is further used to stabilize the learning process. We evaluate our GIB framework in the improvement of graph classification, graph interpretation and graph denoising. Experimental results verify the superior properties of IB-subgraphs. We use the toy dataset from https://github.com/mzgubic/MINE The statistics of datasets in graph classification are collected from http://networkrepository.com



Figure2: The molecules with their interpretable subgraphs discovered by different methods. These subgraphs exhibit similar chemical properties compared to the molecules on the left.5.2 DATASETSImprovement of graph classification:We evaluate different methods on the datasets of MUTAG(Rupp et al., 2012), PROTEINS(Borgwardt et al., 2005), IMDB-BINARY and DD(Rossi & Ahmed, 2015) datasets. 2 . The statistics of the datasets are available in Table7of Appendix.

Figure 3: We use the bi-level objective to minimize the mutual information of two distributions. The MI is consistent with the loss as L M I declines.

Figure 4: The histgram of absolute bias between the property of graphs and subgraphs.

Figure 5: The molecules with its interpretation found by GIB. These subgraphs exhibit similar chemical properties compared to the molecules on the left.

Figure 6: We show the blindly denoising results on permuted graphs. Each method operates on the line-graphs and tries to recover the true topology by removing the redundant edges. Columns 4,5,6 shows results obtained by different methods, where "miss: m, wrong: n" means missing m edges and there are n wrong edges in the output graph. GIB always recognizes more similar structure to the ground truth (not provided in the training process) than other methods.

Classification accuracy. The pooling methods yield pooling aggregation while the backbones yield mean aggregation. The proposed GIB method with backbones yields subgraph embedding by aggregating the nodes in subgraphs.

The mean and standard deviation of absolute property bias between the graphs and the corresponding subgraphs.

Ablation study on L con and L M I . Note that we try several initiations for GIB w/o L con and L M I to get the current results due to the instability of optimization process.

Quantitative results on graph denoising. We report the classification accuracy (Acc), number of real edges over total real edges (Recall) and number of real edges over total edges in subgraphs (Precision) on the test set



The influence of the hyper-parameter α of L con to the size of subgraphs.

y, G sub ) log p(y|G sub )dy dG sub -p(y, G sub ) log p(y)dy dG sub = p(y, G sub ) log p(y|G sub )dy dG sub + H(Y ) ≥ p(y, G sub ) log q φ1 (y|G sub )dy dG sub + KL(p(y|G sub )|q φ1 (y|G sub ))

Statistics of datasets in improvement of graph classification.

ACKNOWLEDGEMENTS

This work is partially funded by Beijing Natural Science Foundation (Grant No. JQ18017) and Youth Innovation Promotion Association CAS (Grant No. Y201929).

A.9 IMPLEMENTATION WITH OTHER MUTUAL INFORMATION ESTIMATORS

As shown in (Sun et al., 2019; Nowozin et al., 2016) , the f-divergence family can also approximate the mutual information. Here we provide the results of GCN+GIB with Jensen-Shannon Divergence (JSD) and χ 2 Divergence (χ 2 ) on graph classification in Table 11 . Experiment results show that our model can also employ other mutual information estimators for bilevel optimization. As the initialization of our model may potentially influence the final chosen subgraphs, we rerun our model five times on the QED dataset for graph interpretation task. Then, we employ the intersection over union (IoU ) to measure the overlap between the subgraphs in 5 different runs and the results reported in Table 2 . Similarly, we compute the IoU between the chosen subgraphs and their largest connected parts separately, which refer to IoU all and IoU max . We finally report the mean and standard deviation of IoU all , IoU max on the testing set in Table 12 . We notice that different initialization has limited influence on the chosen subgraphs, as all the results of five additional runs have high portions of common nodes with the initial run.Published as a conference paper at ICLR 2021 

