SEARCHING LOTTERY TICKETS IN GRAPH NEURAL NETWORKS: A DUAL PERSPECTIVE

Abstract

Graph Neural Networks (GNNs) have shown great promise in various graph learning tasks. However, the computational overheads of fitting GNNs to large-scale graphs grow rapidly, posing obstacles to GNNs from scaling up to real-world applications. To tackle this issue, Graph Lottery Ticket (GLT) hypothesis articulates that there always exists a sparse subnetwork/subgraph with admirable performance in GNNs with random initialization. Such a pair of core subgraph and sparse subnetwork (called graph lottery tickets) can be uncovered by iteratively applying a novel sparsification method. While GLT provides new insights for GNN compression, it requires a full pretraining process to obtain graph lottery tickets, which is not universal and friendly to real-world applications. Moreover, the graph sparsification in GLT utilizes sampling techniques, which may result in massive information loss and aggregation failure. In this paper, we explore the searching of graph lottery tickets from a complementary perspective -transforming a random ticket into a graph lottery ticket, which allows us to more comprehensively explore the relationships between the original network/graph and their sparse counterpart. Compared to GLT, our proposal helps achieve a triple-win situation of graph lottery tickets with high sparsity, admirable performance, and good explainability. More importantly, we rigorously prove that our model can eliminate noise and maintain reliable information in substructures using the graph information bottleneck theory. Extensive experimental results on various graphrelated tasks validate the effectiveness of our framework.

1. INTRODUCTION

Graph Neural Networks (GNNs) Kipf & Welling (2016) ; Hamilton et al. (2017) have recently emerged as the dominant model for a diversity of graph learning tasks, such as node classification Velickovic et al. (2017) , link prediction Zhang & Chen (2019) , and graph classification Ying et al. (2018) . The success of GNNs mainly derives from a recursive neighborhood aggregation scheme, i.e., message passing, in which each node updates its feature by aggregating and transforming the features of its neighbors. However, GNNs suffer notoriously high computational overheads when scaling up to large graphs or with dense connections, since conducting message passing over large or dense graphs proves costly for training and inference Xu et al. (2018) ; You et al. (2020) . To alleviate such inefficiency, existing approaches mostly fall into two research lines -that is, they either simplify the graph structure or compress the GNN model. Within the first class, many studies Chen et al. (2018) ; Eden et al. (2018) ; Calandriello et al. (2018) have investigated the use of sampling to reduce the computational footprint of GNNs. These sampling-based strategies are usually integrated with mini-batch training schedule for local feature aggregation and updating. Another representative is graph sparsification techniques Voudigari et al. (2016) ; Zheng et al. (2020) ; Li et al. (2020b) which improve training or inference efficiency of GNNs by learning to remove redundant edges from input graphs. In contrast to simplifying the graph structure, there are much fewer prior studies on pruning or compressing GNNs Tailor et al. (2020) , as GNNs are generally less parameterized than DNNs in other fields, e.g., computer vision Wen et al. (2016) ; He et al. (2017) . Further, Graph Lottery Ticket hypothesis (GLT) Chen et al. (2021) has surprisingly killed two birds with one stone, i.e., for the first time it simultaneously simplifies the input graph and prunes the GNNs without compromising model performance. The key insight is to generalize the theory of Lottery Ticket Hypothesis (LTH) Frankle & Carbin (2018) to GNNs. Recall that LTH articulates there always exist sparse and high-performance subnetworks in a dense network with random initialization (like winning tickets in a lottery pool), GLT delineates a Graph Lottery Ticket as a combination of core subgraph and sparse subnetwork with admirable performance. More specifically, GLT first devises a unified GNN sparsification (UGS) strategy for jointly pruning the graph adjacency matrix as well as the network weights, and then iteratively applies UGS to uncover the winning tickets in GNNs. Extensive experiments on GNN benchmarks have verified the effectiveness of GLT across various architectures, learning tasks, and initialization ways. GLT Chen et al. (2021) . The red star (★) denotes a node that connects two important communities. After adopting GLT, it can be seen that the edge connecting the two communities is discarded. Consequently, ★ can no longer aggregate information from both sub-structures. After revisiting the theory of GLT, we expose two crucial factors that may impede GLT in practice. Firstly, GLT takes a whole pretraining process to obtain a sparse subnetwork, which limits its applicability to real-world usages and meanwhile complicates the investigation of the relationships between the original network (or graph) and their sparse counterparts. Secondly, the sampling-based graph simplification in GLT may lead to two devastating challenges: a) Information loss: pruning subgraph edges as GLT does may cause massive information loss, resulting in performance collapse Wu et al. (2022a) . b) Aggregation failure: as the sparsity increases, some "unimportant" edges may be discarded by means of a pruning algorithm, but sometimes they connect two very important local communities (see Fig. 1 ). In this paper, we investigate a more universal yet challenging problem from a complementary perspective of GLT: how to transform a randomly selected ticket (i.e., a pair of graph and network) to a graph lottery ticket in GNNs? Compared to the magnitude-based network pruning and graph sparsification in GLT, such a transformation process enables us to more comprehensively explore the relationships between the original network/graph and their sparse counterparts. Two-fold efforts are made by us to answer the above question, including regularization-based network pruning and hierarchical graph sparsification. Primarily, we present the first attempt to generalize the Dual Lottery Ticket Hypothesis (DLTH) Bai et al. (2022) for GNN network pruning. Being initially designed for pruning deep neural networks, DLTH utilizes a Gradually Increased Regularization (GIR) term Wang et al. (2020a) to transfer the model expressivity from the discarded part to the remaining part. When adapting to GNNs, we first randomly select a target sparse subnetwork within the original dense network, and then attach GIR on the rest part to stimulate magnitude discrepancy among the parameters. In other words, as the regularization penalty factor increases, the information is continuously extruded from the rest part into our target subnetwork. Once the difference among parameters is discrepant enough, we remove the rest part to realize a final sparse network. However, GIR is not applicable to graph sparsification when generalizing DLTH for finding graph lottery tickets. To this end, we propose Hierarchical Graph Sparsification (HGS) that is not only compatible with the GIR-based pruning strategy but also mitigates the information loss and aggregation failure issue in GLT. HGS learns a differentiable soft assignment matrix for nodes at each GNN layer, projecting nodes to a set of clusters which is then utilized as the coarsened input for the next GNN layer. Hierarchical representations of graphs are produced accordingly. Finally, we elementwise product the adjacency matrix of the coarsened graph at each GNN layer with a trainable mask for graph sparsification. In this way, useful information is extruded into the anticipative structure, thereby avoiding massive information loss. Note there are no node or edge dropping operations in our method, HGS can naturally remedy the aggregation failure in GNNs as well. We elaborately unify the above regularization-based pruning and hierarchical graph sparsification into a single framework for transforming a random-selected ticket into a graph lottery ticket in GNNs, leading to a dual perspective of GLT. We therefore name our framework as Dual Graph Lottery Tickets (DGLT). Similar to GLT, DGLT is model-agnostic and makes no assumptions on the graph structure, and can be easily applied and scaled up to a variety of graph-based learning tasks. To enhance its explainability, we theoretically prove our information extrusion approach from the popular Graph Information Bottleneck (GIB) theory. Our contributions are summarized as follows: • We explore a new and non-trivial problem of transferring a random ticket to a graph lottery ticket in GNNs. Compared to GLT which pretrains dense GNNs to recognize graph lottery tickets, transferring a random ticket into a pair of high-performance sparse network and core subgraph is more appealing and valuable in practical usage, which allows us to investigate the relationships between the original network/graph and their sparse counterparts in a principle way. • We present the Dual Graph Lottery Ticket (DGLT) framework to transform a random ticket into a triple-win graph lottery ticket, i.e., with high sparsity, high performance, and good explainability. DGLT prunes the GNN architecture by GIR-based information extrusion and sparsifies the input graph in a hierarchical manner to defeat information loss and aggregation failure in GLT. Moreover, the graph information bottleneck theory is utilized to guarantee the algorithm's preeminence. 

2. PRELIMINARY & RELATED WORK

Graph Neural Networks. Given an undirected graph G = {V, E} with a node set V and an edge set E, GNNs aim to learn a representation vector of a node or an entire graph based on the adjacency matrix A ∈ R |V|×|V| and node features X ∈ R |V|×F . Modern GNNs mostly follow a message passing strategy, in which we iteratively update the representation of a node v i ∈ V by aggregating and transforming the representations of its neighbors. For example, Kipf & Welling (2016) propose a two-layer GNN with learnable parameters Θ = {Θ (0) , Θ (1) } for node classification as: 1) , and L (G, Θ) = - Z = S Âσ ÂXΘ (0) Θ ( vi∈V l y i log (z i ) is the loss function, where Z is the prediction results, σ (•) denotes an activation function, Â = D-1 2 (A + I) D-1 2 is the normalized adjacency matrix with self-loops and D is the degree matrix of A + I. To optimize such GNNs, we minimize the cross-entropy loss L (G, Θ) over all labelled nodes V l ⊂ V, where y i and z i represents the label and prediction of node v i , respectively. More message passing schemes have been investigated in Hamilton et al. (2017) ; Velickovic et al. (2017) ; Li et al. (2020a) . Despite the promising results obtained by GNNs, they encounter notorious inefficiency when scaling up to large or dense graphs. Many streams of work have been dedicated to solving this issue. Graph sampling or sparsification accelerates the representation learning process of GNNs by manually or automatically extracting a sub-structure from the original graph Cheng et al. ( 2017 2022) prune GNNs for speeding up reasoning. Recently, GLT Chen et al. (2021) has presented the first attempt to jointly sparsify the input graph and the GNN model, which significantly trims down the computational cost without compromising predictive accuracy. More importantly, GLT has opened up a novel research line to the graph learning community on which our framework is built. Lottery Ticket Hypothesis. LTH articulates that a sparse and admirable subnetwork can be identified from a dense network by iterative pruning Frankle & Carbin (2018) . LTH is initially observed in dense networks and is broadly found in many fields Evci et al. (2020) ; Frankle et al. (2020) ; Malach et al. (2020) ; Ding et al. (2021) ; Chen et al. (2020a; 2021) ; Sui et al. (2021) . Derivative theories Chen et al. (2020b) ; You et al. (2021) ; Ma et al. (2021) are proposed to optimize the procedure of network sparsification and pruning. In addition to them, Dual Lottery Ticket Hypothesis (DLTH) considers a more general case to uncover the relationship between a dense network and its sparse counterparts Bai et al. (2022) . It argues that when attaching GIR to a pre-selected part of a dense network, the complementary part can be transformed into an excellent winning ticket in an isolated training way. In this paper, we draw inspiration from DLTH and for the first time explore a dual problem of GLT, i.e., how to transform a random ticket into a graph lottery ticket in GNNs.

3. METHODOLOGY

Figure 2 presents an overview of our DGLT for transforming a random ticket to a graph lottery ticket in GNNs. The first step is selecting a target structure for subnetwork and subgraph. After that, the network is pretrained with Gradually Increased Regularization (GIR) for information extrusion. We meanwhile perform Hierarchical Graph Sparsification (HGS) to produce coarsened subgraph representations at each GNN layer. When GIR reaches a threshold, we stop adding penalty factors and train GNNs until the loss converges. Finally, we prune the network/graph for a joint sparsification with one shot and fine-tune the pruned model to obtain graph lottery tickets for evaluation. In the following parts, we commence by introducing HGS in Sec. 3.1 and then elaborate on how GIR benefits the sparse network training by extruding information from other weights to the target sparse structure (see Sec. 3.2). We further theoretically justify DGLT's power in transforming graph lottery tickets using the popular Graph Information Bottleneck (GIB) theory in Sec. 3.3. Frequently-used notations are listed in Appendix B for clarity. As seen in Fig. 2 , HGS learns the embedding representation Z (l) and assignment matrix S (l) at layer l (l = 1, 2, . . . , L) via two non-shared GNNs layers, respectively. To be specific, we first adopt GNN sparsification layer (denoted as GNN (l) hgs ) after each GNN embedding layer (denoted as GNN (l) em ) for projecting nodes to a set of feature clusters X (l) and coarsened adjacency matrix A (l) , which are then utilized as the coarsened input for the next GNN layer or final prediction. To facilitate subgraph generation, we impose a differentiable mask m (l) A for A (l) in the hierarchical sparsification process: Z (l) = GNN (l) em A (l-1) , X (l-1) ; Θ (l) em , S (l) = softmax GNN (l) hgs A (l-1) ⊙ m (l-1) A , X (l-1) ; Θ (l) hgs , For Eq. 1, embedding GNN at l-th layer applies X (l-foot_0) and A (l-1) to produce node embedding representation Z (l) for clustering in next sparsification layer. For Eq. 2, ⊙ is the elementwise product operation, softmax function is applied in a row-wise fashion. assignment matrix S (l) ∈ R n l-1 ×n l (n l-1 > n l ) can project coarsened subgraph into n l clusters. Notably that, in our implementation, we hope that n l-1 is slightly larger than n l to ensure that graph information in the hierarchical clustering process extrudes steadily into small sub-structure. Given the embedding Z (l) and assignment matrix S (l) , we apply the following equation to obtain adjacency matrix A (l) and embedding X (l) in the next layer: A (l) = S (l) T A (l-1) S (l) , X (l) = S (l) T Z (l) . After L times of iterative embedding and sparsification of the input graph, we can obtain a resilient subgraph representation G hgs = {A (L) , X (L) } in the last layer L, where A (L) denotes the connectivity between each pair of subgraph clusters and X (L) represents new node representation.

3.2. GRADUALLY INCREASED REGULARIZATION FOR INFORMATION EXTRUSION

L 2 regularization, commonly known as weight decay Loshchilov & Hutter (2018) , is one of the most popular regularization terms. L 2 regularization draws the weights closer to the origin by adding a constraint term Ω (w) = 1 2 α ||w|| 2 2 to the loss function, where α represents the penalty coefficient. DLTH Bai et al. (2022) presents the first attempt to leverage L 2 regularization to extrude information from pre-selected part to its complementary counterpart. It demonstrates that when progressively increasing the penalty coefficient by adding a mini-step value, the difference between the weights will be separated and the unimportant weights naturally pushed to a position close to zero LeCun et al. (1989) ; Wang et al. (2020a) (see proof in Appendix A). As depicted earlier, we get a sparse 1 representation G sub . Then, we directly element-product A (L) and a trainable mask m (L) A , and send a combination of m (L) A ⊙ A (L) with X (L) to GNN and MLPlayer predictors for label forecasting (Bottom right in Fig. 2 ). For DGLT framework, we pick a sparse network structure from the whole network parameters Θ and trainable matrices, then attach GIR on the rest part to extrude information toward the target structure. DGLT can be achieved as optimization following objective function: L DGLT := L ({m A ⊙ A all , X} , Θ) + ξ ||m * A || 2 2 + ρ ||Θ * || 2 2 (4) ξ (p+1) = ξ (p) + ξ a ξ (p) < ξ ceil ξ (p) ξ (p) = ξ ceil ρ (p+1) = ρ (p) + ρ a ρ (p) < ρ ceil ρ (p) ρ (p) = ρ ceil In Eq. 4, m A and A all denote the mask sets and adjacency matrices of all sparsification layers; in addition to the cross-entropy loss, the objective function contains two L 2 regularization terms, where m * A and Θ * are pre-selected parameters in m A and Θ which will be discarded after pretraining. Finally, under the interplay of progressively mini-step addition of regularization penalty, the m * A and Θ * are pushed to a position close to zero, and the information is extruded to the target part. In Eq. 5, ξ (p) and ρ (p) are the regularization terms at p-th updating. ξ a and ρ a indicate increased mini-step values of penalty coefficient. ξ ceil and ρ ceil indicate ceiling values of two regularization terms. We set range of regularization terms and control regularization value increases in a linear fashion until reaches their ceiling. To facilitate reading, we show the algorithm in Algo. 1

3.3. GIB VIEW OF GRADUALLY INCREASING REGULARIZATION

Graph Information Bottleneck (GIB): Information Bottleneck (IB), which originates from the information theory, aims to find a compression code of the input signals while retaining as much valid information as possible from the original encoding Tishby et al. (2000) . In recent years, IB is naturally adapted to deep neural networks in a variety of applications and shows excellent effects Peng et al. (2018) ; Luo et al. (2019) ; Wang et al. (2020b) ; Wu et al. (2020) ; Yu et al. (2020) ; Miao et al. (2022) . Our work builds upon graph field and often known as Graph Information Bottleneck (GIB), as defined above, Y = y 1 , y 2 . . . .y |V| denotes the label of all nodes and G is the input graph. GIB-based methods try to find an optimal subgraph G * s in subgraph set G sub (G) by optimizing: max G sub ∈G sub (G) I (G sub , Y ) -βI (G sub , G) -→ G * s (6) G sub denotes a subgraph of the G and I (•) represents Shannon mutual information. β is the hyperparameter used to control the proportion of the two parts. In Eq. 6, The first term is used to maximize the mutual information of subgraphs and labels, and the second term wants the subgraph to be as small as possible. GIB theory tries to identify unserviceable or noisy nodes of the training graphstructured data describe spurious correlation-versus-causations. Intuitively, a spurious correlation means that after the introduction of nodes or graphs in the training set cannot increase or even reduce the mutual information between the training set and the label Glymour et al. (2016) ; Arjovsky et al. (2019) ; Krueger et al. (2021) . Different from sampling models, we try to transform a full graph into Ĝ * s through a transformation function T ( * ) (i.e., gradually increasing regularization) and guarantee spurious correlation removal (satisfies Eq. 6). We will provide theoretical analysis of how DGLT can obtain an admirable subgraph from GIB perspective. Lemma 1: when the penalty is increased at the same pace, because of different local curvature structures, the weights respond differently -weights with larger curvature will be less moved. As such, the magnitude discrepancy among weights will be magnified as regularization grows. Ultimately, the weights will naturally separate (unimportant weights tend to be very small and can be regarded as noise) Wang et al. (2020a) . Based on Lemma 1, we list our two observations about spurious correlations distribution in graph: (1) Some nodes are pure noise; (2) Some node are composed of bootless correlations and useful associations. Under these two Observations. We can get such solution: Suppose each G contains subset G sub (G), there exist G * s ∈ G sub (G) can remove spurious correlations, i.e., I (G * s , Y ) ≥ I (G, Y ). Unserviceable or noisy information are distributed in graph node features. In our implementation, we can transform a G to Ĝ * s and make sure that I Ĝ * s , Y ≥ I (G * s , Y ) ≥ I (G, Y ). Theoretical Analysis. Obs 1: There exist pure noise nodes (note nodes set as Ġsub ) which make no contributions to I (G, Y ) at all. Considering that our target is to maximize I (G sub , Y ): I (G sub , Y ) = I (Y ; G, G sub ) -I (Y ; G|G sub ) = I (Y ; G) -I (Y ; G|G sub ) Where the first equality is because Chain Rule for Mutual Information and the second equality is because Theoretical Analysis. Obs 2: The spurious information is distributed in some nodes and these nodes are composed of valid information and spurious information. G sub ∈ G sub (G). I (Y ; G) is Suppose Borgwardt et al. (2005) . The statistics of these datasets can be seen in Table 5 . Backbones & Parameter Settings. For all selected backbones and datasets, we compare our DGLT algorithm with GLT Chen et al. (2021) and the random pruning algorithm under the same network settings. For regular-scale datasets, we adopt GCN Kipf & Welling (2016), GIN Xu et al. (2018) and GAT Veličković et al. (2017) as backbones. For Cgbl-Collab which is a large-scale dataset, we take 28-layer deep ResGCNs Li et al. (2020a) as our backbone for link prediction. To evaluate our DGLT on the graph classification task, GraphSAGE Hamilton et al. (2017) is leveraged as the backbone on D&D and ENZYMES. More details about experimental settings can be found in Appendix E.

4.2. CAN DGLT FINDS GRAPH LOTTERY TICKETS? (RQ1)

To answer RQ1, we compare our DGLT with GLT and random pruning on node classification. When investigating how accuracy changes with the growth of graph sparsity, we fix the weight sparsity to zero for stability, and vice versa. The results on Citeseer and PubMed are depicted in Fig. 4 , and those on Cora is shown in Fig. 8 . From Fig. 4 and 8, we have the following observations: (1) DGLT consistently outperforms GLT and random pruning under the same graph/weight sparsity over all datasets, verifying the superiority of transforming a random ticket to a graph lottery ticket via information extrusion and hierarchical graph sparsification. For example, the graph lottery ticketfoot_1 on PubMed+GIN identified by DGLT is with 85% graph sparsity or 87% weight sparsity. For Citeseer+GIN, we can obtain it with 65.0% and 91.0% sparsity on graph/weights using DGLT. As for random pruning, the effectiveness of the model decreased significantly when the sparsity rate increased, which further demonstrate the superiority of our DGLT algorithm. (2) DGLT for the first time enables us to search an ultra-lightweight subnetwork in GNNs. For node classification on Cora and Citeseer, the model surprisingly shows no significant performance drop (even surpassing the baseline performance) until 90% weight sparsity. For a larger dataset (i.e., PubMed), DGLT achieves 85.0% graph sparsity without performance degradation. We demonstrate that DGLT transforms graph lottery tickets by GIR, which can remedy the information loss and thus transfer a more informative pair of subgraph and subnetwork. (3) Whether we can find an extremely sparse subgraph depends on the property of input graphs. Though DGLT can help find extreme sparse subgraphs in GNNs (e.g., on PubMed+GIN or PubMed+GAT), this phenomenon does not occur in smaller datasets (e.g., Cora). We argue that the graph property such as graph size may be an crucial factor to the possibility of transforming a very sparse subgraph. This question will be discussed further in Sec. 4.3. Besides node classification, we present additional empirical results on link prediction in Appendix F.

4.3. HOW DOES DGLT PERFORM ON LARGE-SCALE DATASET (RQ2)

Graph sparsity (%) Weight sparsity (%) Hits@50 (%) Figure 5 : Link prediction results on Ogbl-Collab+ResGCNs. To answer RQ2, we conduct experiments on the Ogbl-Collab dataset using ResGCNs as the backbone. We show the potential of DGLT in practice by evaluating the trade-off among the reasoning time overhead, accuracy, and memory savings. As shown in Fig. 5 , DGLT performs better than GLT and random pruning across different graph sparsity and weight sparsity. By using DGLT, we can obtain a graph lottery ticket with nearly 70% graph sparsity and 85% weight sparsity which outperforms GLT by 6%∼7%. Meanwhile, we find that DGLT is more stable on large-scale datasets. Similar to DGLT's performance on PubMed, as graph sparsity increases (until ∼90% sparsity), we witness a slow decline in performance, but there was a very obvious fluctuation in the small datasets (e.g. Cora), which shows that DGLT is more stable and conducive to expanding to large-scale datasets.

4.4. ABLATION STUDIES (RQ3 & RQ4)

To complement the experiments in Sec. 4.2, we investigate a more complex case described in RQ3 by controlling the graph sparsity P g or weight sparsity P θ at a fixed ratio (10%, 30%, 50%, and 70%) and examining the model performance under different sparsity of the other term. The experimental results are reported in Fig. 6 . It can be seen easily that each line has a similar trend in both sub-figures, where the accuracy slightly drops as the growth of the graph (or weight) sparsity. For example, given P θ = 70%, we can obtain an admirable subgraph (with only 2% lower accuracy) under 40% graph sparsity. The sparsified GNNs preserve excellent performance (only 8%∼14% decrease on accuracy) even when the sparsity reaches 90%. This indicates the capability of DGLT in transforming graph lottery tickets with comparable performance to the original backbone. Moreover, we find that the lines in the right figure are more easily distinguished. When P g = 30%, an expressive subnetwork is achieved with 70% weight sparsity of the original network. To evaluate hierarchical graph sparsification (HGS), we compare DGLT with its variant without HGS which attaches a trainable mask to the adjacency matrix at the last GNN layer for graph sparsification like GLT. As depicted in Tab. 2, DGLT surpass its variant without HGS by a large margin under the same graph sparsity. Such merits stem from its hierarchical information extrusion strategy, which allows the input graph to transfer information to the final small-size graph in a gradually compressed form and thus avoids the instability of one-shot rough extrusion. Graph sparsity (%) Weight sparsity (%) Accuracy (%) Table 2 : Effects of HGS over different datasets and backbones.

4.5. MINI-STEP VALUES OF REGULARIZATION (RQ5)

We control graph/weight sparsity to 30%/70% and choose multiple mini-step values of regularization (ξ a and ρ a take the same mini-step values) for comparison. From Tab. 3, we argue that the mini-step values of regularization should be maintained in a small regime. When the progressive regularization force maintain in larger regime (>5e-4), the model effect decreases significantly. We demonstrate that the process of information extrusion should be slow and excessive steps of increased regularization may cause the model to fall into ill-condition. 

5. CONCLUSION

In this paper, we propose a novel framework entitled Dual Graph Lottery Ticket (DGLT) that couples hierarchical graph sparsification and gradually increasing regularization to achieve triple-win graph lottery tickets (with high sparsity, admirable performance, and good explainability). Our work first points out that an admirable subgraph can be obtained by efficient hierarchical compression, which helps defeat the off-and-shelf sampling-based GNNs methods. We further generalize the key idea of the dual lottery tickets hypothesis for GNNs across various GNN backbones, learning tasks, and benchmarks. These explorations provide us a new perspective to uncover the relationships between the full model/graph and its sparse counterpart. ∇ w ∼ J(w) = H (w -w * ) = 0 Gradually increasing regularization (GIR): We add weight decay gradient in 14, as regularized J reaches minimum (corresponding to w), we can obtain: α w + H ( w -w * ) = 0 =⇒ w = (H + αI) -1 Hw * After increasing the penalty α by δα, the new converged weights ŵ have the same relation ( 15) with previous round convergence point w: ŵ = (H + δαI) -1 H w where I represents identity matrix. For ease of exposition and in order to make the paper more self-contained, we prudently inherit and place the proofs which analysis in Wang et al. (2020a) , we list two simplified cases to move forward. (1) H is diagonal. For ŵi with second derivative h ii . As adding δα > 0 regularized force, the new converged weights can be proved to be ŵi = h ii h ii + δα wi =⇒ ŵi wi = 1 δα/h ii + 1 Where ŵi wi ∈ [0, 1) since h ii ≥ 0 and δα > 0. As seen, when h ii ↑, make ŵi wi ↑, we can find that the weight relatively less moves towards the origin. (2) We consider a general case (2-d)  ŵ1 ŵ2 = 1 Ĥ h11h22 + h11δα -h 2 12 w1 + δαh12 w2 h11h22 + h22δα -h 2 12 w2 + δαh12 w1 ≈ 1 Ĥ h11h22 + h11δα -h 2 12 w1 h11h22 + h22δα -h 2 12 w2 (18) =⇒ ŵ1 w1 = 1 Ĥ h 11 h 22 + h 11 δα -h 2 12 ŵ2 w2 = 1 Ĥ h 11 h 22 + h 22 δα -h 2 12 (19) As seen, we can draw the same conclusion: h 11 > h 22 also results in ŵ1 w1 > ŵ2 w2 . According to Wang et al. (2020a) , when the penalty is increased at the same pace, due to the different local curvature structure, the weights response in a different way. Weights with larger curvature will be less moved. As regularization gradually increases, the magnitude discrepancy among weights will be magnified and the unimportant weights tend to zero and can be regarded as noise.

B NOTATIONS

For the convenience of reading and consulting, we put all the notations of this work in Table 4 .

C ALGORITHM OF DUAL GRAPH LOTTERY TICKET FRAMEWORK

In this part, we summarize our DGLT algorithm process in Algo. 1. We first select a part of parameters to be discarded later. Then we initial our framework parameters. and perform a backpropagation algorithm to update the network parameters. After that we adopt GIR for information extrusion. until the useful information is sufficiently extruded into the target structure, we one-shot pruning pre-defined structure and fine-tune unpruned parameters for model evaluation. Embedding representation after l-th GNN embedding layer S (l) Assignment matrix for l-th GNN embedding layer output A (l) Adjacency matrix after l-th GNN sparsification layer X (l) Nodes representation after l-th GNN sparsification layer Θ A all A all = A, A (1) . . . A (L) represent all adjacency matrix outputs  mA mA = m (0) A , m

E EXPERIMENTAL SETTINGS

Metrics: Accuracy is the proportion of correct prediction results in all predictions. ROC-AUC (Receiver Operating Characteristic-Area Under the Curve) value is equivalent to the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example. Algorithm 1 Dual Graph Lottery Tickets (DGLT) Algorithm (aligned with Fig. 2 ) Require: Input graph G = {A, X}, GNN f (G; mA, Θ) with initialization parameters Θ0 and mA 0 , step size η, penalty hyper-parameters ξ and ρ of regularization terms. 1: Select a part of parameters to be discarded later, i.e., m * A in mA and Θ * in Θ. 2: for layer ϕ = 1, 2 . . . L in embedding GNN and sparsification GNN do 3: Learn ϕ-layer embedding representation Z (ϕ) in Eq. 1.

4:

Learn ϕ-layer coarsened representation by S (ϕ) in Eq. 2. 5: Hierarchical cluster to form new graph in Eq. 3. 6: Obtain G sub = {A (L) , X (L) }.

7: m (L)

A ⊙ A (L) and send G ′ sub = m (L) A ⊙ A (L) , X (L) into predictor. 8: for iteration i = 1, 2 . . . E do 9: Forward f ({mA ⊙ A, X} , Θ) to compute loss LDGLT in Eq. 4. 10: Update Θi+1 ←-Θi -η∇Θ i LDGLT 11: Update mA i+1 ←-mA i -η∇m A i LDGLT 12: Increase regularization penalty Θ0 and mA 0 by mini-step in Eq. 5 13: One-shot pruning of m * A and Θ * . 14: Fine-tune the model with rest parameters. 15: return Dual graph lottery ticket. Hit@50 means that taking the candidate edge of the top 50, the proportion of the 50 edges is predicted correctly. Sparsity ratio: we transform graphs into small size representations and note 1 -||A (L) || 0

|A|

as graph sparsity ratio. ||•|| 0 and |•| 0 are the number of non-zero elements and total number of elements, respectively. Similarly, weight sparsity denotes that the ratio of the discarded parameters to the total parameters in the whole network. Train-val-test Splitting of Datasets. To rigorously verify the effectiveness of our proposed DGLT algorithm, we control the network designs consistent under the same task. As for node classification task of regular-size datasets, we follow the same data split criteria among different backbones, i.e., 700 (Cora), 420 (Citeseer) and 460 (PubMed) labeled data for training, 500 nodes for validation and 500 nodes for testing. As for link prediction, we shuffle the datasets and sample 85% edges for training, 10% for validation, 5% for testing, respectively. For Ogbl-Collab, in order to simulate a real collaborative recommendation application, we take the cooperation before 2017 as the training edge, the cooperation in 2018 as the validation edge and the cooperation in 2019 as the testing edge. For graph classification task, we choose D&D and ENZYMES datasets. D&D and ENZYMES includes graphs of protein structures, in which a node represents an amino acid. We perform 10-fold cross-validation to observe model performance and reported the accuracy averaged over 10-fold. Backbone settings. As for regular-scale datasets Cora, Citeseer and PubMed, we adopt GCN/GIN/-GAT backbones for node classification and link prediction tasks. In our implementation, we adopt 3-layer GCN, 3-layer GIN and 3-layer GAT, respectively. For link prediction in Ogbl-Collab, we adopt 28-layer ResGCNs. For graph classification, we hope to obtain a small size graph representation and we use GraphSAGE for forecasting. Concretely, we select three sparsification layers for graph size scaling. After the last sparsification layer, we add a GrpahSAGE layer and MLP for prediction. Further, we place the training details and hyper-parameter configuration in Table 6 . Weight sparsity settings. In our implementation, we control the sparsity of each individual layer to be equal to the total sparsity and random choice weights in each layer. In this setting, we keep the first layer dense, since sparsifying this layer has a disproportional effect on the performance and almost no effect on the total size. Hierarchical graph sparsification ratio. For graph classification, we found that n l has little effect on the results, while the final n L can be limited to a small size (even 95% graph sparsity). As for node classification and link prediction tasks, we should control n l to be slightly larger than n l+1 in hierarchical graph sparsification (HGS) process. 

F ADDITIONAL EXPERIMENTS TO ANSWER RQ1

More experiments on link prediction task are shown in Fig. 7 . We make observations as follows: (1) DGLT aggressively improves the reasoning efficiency without significant performance degradation. For Cora dataset, we can get graph lottery tickets with nearly 50% graph sparsity and 80% weight sparsity. For Citeseer dataset, we can get graph lottery tickets with nearly 70% graph sparsity and 85% weight sparsity. For PubMed dataset, we can get graph lottery tickets with nearly 60% 2022). Our explanation focuses on why the transformed subgraph can make reliable predictions, which is similar to those described above. In our implementation, we want to give an explanation from a graph information bottleneck perspective about why the transformed subgraph has expressive ability. This "explanations" process (Section 3.3) is similar to Miao et al. (2022) , the difference is that our subgraph is not a fraction of the original full graph but is obtained by a hierarchical graph sparsification (HGS) algorithm. However, to our best knowledge, this is the first work to transform a graph lottery ticket from the information bottleneck perspective and we will explore how the relationships between GNN explainability with gradually increased regularization in the future work.

I DISCUSSION OF DGLT

In this work, we follow the perspective of the Dual Lottery Ticket Hypothesis (DLTH) and investigate GNN subnetwork training and subgraph identifying from a complementary direction-that is, given a specific substructure of GNN model or size of adjacency matrix, we can always transform them to a winning graph lottery ticket (please note, we consider the common case based on uniformly random selection for GNN model or subgraph compression form, not including certain extreme situations such as the disconnected subnetworks or subgraph). This conjecture, if it is true, has rather promising practical implications-it may suggests that the message passing function (i.e., information aggregation) of training a GNN model is in fact unnecessary as one only needs to select a target size of adjacency matrix or target substructure of GNN, and then use hierarchical graph sparsification (HGS) algorithm or gradually increased regularization for information extrusion. Comparisons with DLTH. DLTH focuses on transforming a randomly initialized dense network into an admirable subnetwork, which can achieve better at least comparable to LTH. Building upon DLTH, our DGLT firstly generalizes this idea to GNN, and investigates a more universal yet challenging problem-that is, how to transform a randomly selected ticket (i.e., a pair of graph and network) to a graph lottery ticket in GNNs? However, the key tool of DLTH-gradually increased regularization-is not applicable to graphs due to the fixed structure. To this end, we adopt HGS to break this gap and adjustably pre-define the substructure of adjacency matrices. Compared with the



Different from pruning, we transform graphs into small size representations and note 1 -||A (L) || 0 |A| as sparsity ratio. ||•|| 0 and |•| 0 are the number of non-zero elements and total number of elements, respectively. In our experiments, we specify a graph lottery ticket as a pair of subgraph and subnetwork with comparable accuracy to the baseline of full GNNs on full graphs. This work is partially supported by the National Natural Science Foundation of China (No.62072427, No.12227901), the Project of Stable Support for Youth Team in Basic Research Field, CAS (No.YSBR-005), Academic Leaders Cultivation Program, USTC.





Figure 1: Left: graph before sparsification. Right: a sparse graph obtained by GLT Chen et al. (2021).The red star (★) denotes a node that connects two important communities. After adopting GLT, it can be seen that the edge connecting the two communities is discarded. Consequently, ★ can no longer aggregate information from both sub-structures.

); Chen et al. (2018); Rong et al. (2019); Li et al. (2020b); Faber et al. (2021). Graph compression algorithms attempt to merge an original graph to form a new small graph for fast representation learning Chakeri et al. (2016); Chiang et al. (2019). Tailor et al. (2020); Zhou et al. (2021); Chen et al. (

Figure 2: (Left) The overview of DGLT. (Right) The details of DGLT algorithm.

HIERARCHICAL GRAPH SPARSIFICATION Inspired by recent advances in clustering-based graph pooling methods Wu et al. (2022a); Ying et al. (2018); Roy et al. (2021), we propose Hierarchical Graph Sparsification (HGS) to generate hierarchical graph representations across different GNN layers for graph sparsification, where the number of nodes is reduced as the GNNs go deeper. Towards this goal, HGS learns a differentiable soft assignment matrix for nodes at each GNN layer, mapping input nodes to multiple clusters which are then fed to the next GNN layer as coarsened inputs.

Figure 3: The white/blue node indicate that all features can increase/decrease I (G, Y ), while the green/yellow nodes indicate that there are useful associations and false associations in the nodes, but in the end, the useful/spurious association dominates

given from the beginning, so I (Y ; G) is a constant. Our target is to maximize I (G sub , Y ) (i.e., minimize the I (Y ; G|G sub )). Since all nodes in Ġsub are noise, I Ġsub , Y in I (G, Y ) -I G\ Ġsub , Y = I Ġsub , Y reach the minimum. In GIB, G * s = G\ Ġsub . For gradually increasing regularization, we can extrude information to a pre-defined structure. Since Ġsub are unserviceable, the amount of information squeezed is 0 in transformed Ĝ * s . Based on the above inference, we can obtain I (G * s , Y ) = I Ĝ * s , Y in Obs 1.

Figure 4: Results of node classification over Citeseer/PubMed with GCN/GIN/GAT backbones. Blue dash lines represent the baseline performance of full GNNs on full graphs.

Figure 6: Effects of different pruning ratios for transforming sparse graphs and networks on Cora+GCN setting.

Parameters of layer l-th embedding GNN Θ (l) hgs (l = 1, 2 . . . L) Parameters of layer l-th sparsification GNN m (l) ATrainable matrix for masking A(l)

for A all D COMPLEXITY ANALYSIS OF GLT AND DGLT Follow theGLT Chen et al. (2021), we we present the complexity of DGLT algorithm. As for GLT, the inference time complexity of GLTs isO L × ||m g A|| 0 × F + L × ||m θ || 0 × V ×F 2, where L is the number of layers, ||m g A|| 0 is the number of remaining edges in sparse graph, F is the dimension of feature and |V| is the number of nodes. The inference time complexity of DGLT isO ||m A A all || 0 × F + ||m * || 0 × V ×F 2 + O (K), where m A = m 0 A , m 1 A . . . m L A represent maskset for all adjacency matrix outputs A all . F is the dimension of feature and |V| is the number of nodes. ||m * || 0 represents all remained parameters of two non-shared GNNs. O (K) represents inference time complexity of learning the node embeddings and the assignment matrix. They are obtained by multiplying multiple matrices and the inference time complexity of O (K) = O L × |V| 3 + L × |V| × F .

Figure 7: The results for additional link prediction task. We adopt Cora/Citeseer/PubMed as benchmarks and test our DGLT alogrithms on GCN/GIN/GAT three backbones for comparison. Blue dash lines represent the baseline performance of full GNNs on full graphs.

Figure 9: Ablation studies of pruning ratios for transforming sparse graph p g and network p θ , we select Citeseer+GAT for link prediction task and PubMed+GIN for node classification task. Bule dash lines represent model performance over full graph counterpart with full networks.

• Extensive experiments are conducted on GNN benchmarks to examine our DGLT. The results show that DGLT consistently outperforms GLT across various graph/network sparsity over these benchmarks. For node classification, DGLT achieves 40% ∼ 85% graph sparsity and 65% ∼ 92% weight sparsity (no performance degradation), with about 13% ∼ 30% sparsity improvement on graph and 5% ∼ 10% weights sparsity improvement. For a large-scale dataset, i.e., Ogb-Collab, our model can obtain graph lottery tickets with nearly 43% sparsity gain on the graph. These findings demonstrate its potential in a wide range of real-world applications.

contains at least more useful information than Ĝ * s under the same sparseness as Ĝ * s . We can obtain I Ĝ * s , Y > I (G * s , Y ) > I (G, Y ) in Obs 2. To summarize, DGLT can obtain a comparable (even better) subgraph compare to the samplingbased algorithm. The difference between DGLT and other sampling methods is shown in Tab. 1 Table 1: Comparison between our DGLT, LTH Frankle & Carbin (2018), DLTH Bai et al. (2022) and GLT Chen et al. (2021). Pruning, graph controllability, network controllability, transformation, and pretrain denote the type of pruning process, if sparse graph structure is controllable, if sparse network structure is controllable, if the selected subnetwork needs transformation before fine-tuning, and if pretraining dense network is needed, respectively.

Different mini-step values of regularization over Cora/Citeseer/PubMed with different backbones. For clarity, the highest/second-highest performances are emphasized with red/blue fonts.

instead of diagonal matrix of H.

The notations that are commonly used in Methodology (Sec. 3).

Dataset details. The description of the metrics is placed in Appendix E

Training details and hyper-parameter configuration. ξ (0) and ρ (0) indicate the starting value of the graph regularization and weight regularization, respectively. ξ a and ρ a indicate the size of the graph regularization and weight regularization increase value.

A PROOF OF GRADUALLY INCREASING REGULARIZATION

Regularization is long deemed as a tool on limiting the capacity of deep learning networks, by adding a penalty term Ω (Θ) to the objective function J. We denote the regularized objective loss function by J: J (Θ; X, y) = J (Θ; X, y) + αΩ (Θ)Where α ∈ [0, ∞) is a hyperparameter that weights the relative contribution of the penalty term.Setting α to 0 results in no regularization. Larger values of α correspond to more regularization. For L 2 regularization (commonly known as weight decay), Ω (Θ) = 1 2 ||w|| 2 2 is added into objective function. To simplify the presentation, we assume no bias parameter, so Θ is just equaled to w and we can list objective function as:Where the gradient of objective function is:The whole network can be optimized by:Written another way, the update is:Further we simplify the analysis by making a quadratic approximation to the objective function in the neighborhood of the weights that obtains minimal unregularized training cost, w * = argmin w (w).In the neighborhood of w * , we can obtain:Where H is the Hessian matrix of J with respect to w evaluated at w * . There is no first-order term in this quadratic approximation, because w * is defined to be a minimum, where the gradient vanishes. Meanwhile, due to the w * is the minimum of J, we can conclude that H is positive-semidefinite. The minimum of J occurs where its gradient: Further, we conduct 3-layer GraphSAGE Hamilton et al. ( 2017) counterpart with DGLT algorithm on ENEYMES and D&D datasets under different weight sparsity ratio. Similar to the original graph pooling method DIFFPOOL settings, we control the graph sparsity as 95% (total two pooling layers in DIFFPOOL) and observe the performance under the different weight sparsity. As shown in Table 7 , it is not difficult to find that the information extrusion algorithm well maintains the accuracy under high sparsity. On ENEYMES dataset, We can observe that the expressiveness of the model could be improved even pruning 80% weights. Meanwhile, when the weight sparsity of the model reached 90%, the performance of the model still did not decrease significantly: it only decreased by 2.24% on the ENEYMES dataset and by 0.62% on the D&D dataset. In a nutshell, these results demonstrate that our DGLT algorithm can also be easily generalized to graph classification tasks, which further illustrates that DGLT has resilient pluggability. 

