SEARCHING LOTTERY TICKETS IN GRAPH NEURAL NETWORKS: A DUAL PERSPECTIVE

Abstract

Graph Neural Networks (GNNs) have shown great promise in various graph learning tasks. However, the computational overheads of fitting GNNs to large-scale graphs grow rapidly, posing obstacles to GNNs from scaling up to real-world applications. To tackle this issue, Graph Lottery Ticket (GLT) hypothesis articulates that there always exists a sparse subnetwork/subgraph with admirable performance in GNNs with random initialization. Such a pair of core subgraph and sparse subnetwork (called graph lottery tickets) can be uncovered by iteratively applying a novel sparsification method. While GLT provides new insights for GNN compression, it requires a full pretraining process to obtain graph lottery tickets, which is not universal and friendly to real-world applications. Moreover, the graph sparsification in GLT utilizes sampling techniques, which may result in massive information loss and aggregation failure. In this paper, we explore the searching of graph lottery tickets from a complementary perspective -transforming a random ticket into a graph lottery ticket, which allows us to more comprehensively explore the relationships between the original network/graph and their sparse counterpart. Compared to GLT, our proposal helps achieve a triple-win situation of graph lottery tickets with high sparsity, admirable performance, and good explainability. More importantly, we rigorously prove that our model can eliminate noise and maintain reliable information in substructures using the graph information bottleneck theory. Extensive experimental results on various graphrelated tasks validate the effectiveness of our framework.

1. INTRODUCTION

Graph Neural Networks (GNNs) Kipf & Welling (2016) ; Hamilton et al. (2017) have recently emerged as the dominant model for a diversity of graph learning tasks, such as node classification Velickovic et al. (2017) , link prediction Zhang & Chen (2019), and graph classification Ying et al. (2018) . The success of GNNs mainly derives from a recursive neighborhood aggregation scheme, i.e., message passing, in which each node updates its feature by aggregating and transforming the features of its neighbors. However, GNNs suffer notoriously high computational overheads when scaling up to large graphs or with dense connections, since conducting message passing over large or dense graphs proves costly for training and inference Xu et al. ( 2018 2021) has surprisingly killed two birds with one stone, i.e., for the first time it simultaneously simplifies the input graph and prunes the GNNs without compromising model performance. The key insight is to generalize the theory of Lottery Ticket Hypothesis (LTH) Frankle & Carbin (2018) to GNNs. Recall that LTH articulates there always exist sparse and high-performance subnetworks in a dense network with random initialization (like winning tickets in a lottery pool), GLT delineates a Graph Lottery Ticket as a combination of core subgraph and sparse subnetwork with admirable performance. More specifically, GLT first devises a unified GNN sparsification (UGS) strategy for jointly pruning the graph adjacency matrix as well as the network weights, and then iteratively applies UGS to uncover the winning tickets in GNNs. Extensive experiments on GNN benchmarks have verified the effectiveness of GLT across various architectures, learning tasks, and initialization ways. The red star (★) denotes a node that connects two important communities. After adopting GLT, it can be seen that the edge connecting the two communities is discarded. Consequently, ★ can no longer aggregate information from both sub-structures. After revisiting the theory of GLT, we expose two crucial factors that may impede GLT in practice. Firstly, GLT takes a whole pretraining process to obtain a sparse subnetwork, which limits its applicability to real-world usages and meanwhile complicates the investigation of the relationships between the original network (or graph) and their sparse counterparts. Secondly, the sampling-based graph simplification in GLT may lead to two devastating challenges: a) Information loss: pruning subgraph edges as GLT does may cause massive information loss, resulting in performance collapse Wu et al. (2022a). b) Aggregation failure: as the sparsity increases, some "unimportant" edges may be discarded by means of a pruning algorithm, but sometimes they connect two very important local communities (see Fig. 1 ). In this paper, we investigate a more universal yet challenging problem from a complementary perspective of GLT: how to transform a randomly selected ticket (i.e., a pair of graph and network) to a graph lottery ticket in GNNs? Compared to the magnitude-based network pruning and graph sparsification in GLT, such a transformation process enables us to more comprehensively explore the relationships between the original network/graph and their sparse counterparts. Two-fold efforts are made by us to answer the above question, including regularization-based network pruning and hierarchical graph sparsification. Primarily, we present the first attempt to generalize the Dual Lottery Ticket Hypothesis (DLTH) Bai et al. (2022) for GNN network pruning. Being initially designed for pruning deep neural networks, DLTH utilizes a Gradually Increased Regularization (GIR) term Wang et al. (2020a) to transfer the model expressivity from the discarded part to the remaining part. When adapting to GNNs, we first randomly select a target sparse subnetwork within the original dense network, and then attach GIR on the rest part to stimulate magnitude discrepancy among the parameters. In other words, as the regularization penalty factor increases, the information is continuously extruded from the rest part into our target subnetwork. Once the difference among parameters is discrepant enough, we remove the rest part to realize a final sparse network. However, GIR is not applicable to graph sparsification when generalizing DLTH for finding graph lottery tickets. To this end, we propose Hierarchical Graph Sparsification (HGS) that is not only compatible with the GIR-based pruning strategy but also mitigates the information loss and aggregation failure issue in GLT. HGS learns a differentiable soft assignment matrix for nodes at each GNN layer, projecting nodes to a set of clusters which is then utilized as the coarsened input for the next GNN layer. Hierarchical representations of graphs are produced accordingly. Finally, we elementwise product the adjacency matrix of the coarsened graph at each GNN layer with a trainable mask for graph sparsification. In this way, useful information is extruded into the anticipative structure, thereby avoiding massive information loss. Note there are no node or edge dropping operations in our method, HGS can naturally remedy the aggregation failure in GNNs as well. We elaborately unify the above regularization-based pruning and hierarchical graph sparsification into a single framework for transforming a random-selected ticket into a graph lottery ticket in GNNs, leading to a dual perspective of GLT. We therefore name our framework as Dual Graph



); You et al. (2020). To alleviate such inefficiency, existing approaches mostly fall into two research lines -that is, they either simplify the graph structure or compress the GNN model. Within the first class, many studies Chen et al. (2018); Eden et al. (2018); Calandriello et al. (2018) have investigated the use of sampling to reduce the computational footprint of GNNs. These sampling-based strategies are usually integrated with mini-batch training schedule for local feature aggregation and updating. Another representative is graph sparsification techniques Voudigari et al. (2016); Zheng et al. (2020); Li et al. (2020b) which improve training or inference efficiency of GNNs by learning to remove redundant edges from input graphs. In contrast to simplifying the graph structure, there are much fewer prior studies on pruning or compressing GNNs Tailor et al. (2020), as GNNs are generally less parameterized than DNNs in other fields, e.g., computer vision Wen et al. (2016); He et al. (2017).

Further, Graph Lottery Ticket hypothesis (GLT) Chen et al. (



Figure 1: Left: graph before sparsification. Right: a sparse graph obtained by GLT Chen et al. (2021).The red star (★) denotes a node that connects two important communities. After adopting GLT, it can be seen that the edge connecting the two communities is discarded. Consequently, ★ can no longer aggregate information from both sub-structures.

