RETHINKING GRAPH LOTTERY TICKETS: GRAPH SPARSITY MATTERS

Abstract

Lottery Ticket Hypothesis (LTH) claims the existence of a winning ticket (i.e., a properly pruned sub-network together with original weight initialization) that can achieve competitive performance to the original dense network. A recent work, called UGS, extended LTH to prune graph neural networks (GNNs) for effectively accelerating GNN inference. UGS simultaneously prunes the graph adjacency matrix and the model weights using the same masking mechanism, but since the roles of the graph adjacency matrix and the weight matrices are very different, we find that their sparsifications lead to different performance characteristics. Specifically, we find that the performance of a sparsified GNN degrades significantly when the graph sparsity goes beyond a certain extent. Therefore, we propose two techniques to improve GNN performance when the graph sparsity is high. First, UGS prunes the adjacency matrix using a loss formulation which, however, does not properly involve all elements of the adjacency matrix; in contrast, we add a new auxiliary loss head to better guide the edge pruning by involving the entire adjacency matrix. Second, by regarding unfavorable graph sparsification as adversarial data perturbations, we formulate the pruning process as a min-max optimization problem to gain the robustness of lottery tickets when the graph sparsity is high. We further investigate the question: Can the "retrainable" winning ticket of a GNN be also effective for graph transferring learning? We call it the transferable graph lottery ticket (GLT) hypothesis. Extensive experiments were conducted which demonstrate the superiority of our proposed sparsification method over UGS, and which empirically verified our transferable GLT hypothesis.

1. INTRODUCTION

Graph Neural Networks (GNNs) (Kipf & Welling, 2017; Hamilton et al., 2017) have demonstrated state-of-the-art performance on various graph-based learning tasks. However, large graph size and over-parameterized network layers are factors that limit the scalability of GNNs, causing high training cost, slow inference speed, and large memory consumption. Recently, Lottery Ticket Hypothesis (LTH) (Frankle & Carbin, 2019) claims that there exists properly pruned sub-networks together with original weight initialization that can be retrained to achieve comparable performance to the original large deep neural networks. LTH has recently been extended to GNNs by Chen et al. (2021b) , which proposes a unified GNN sparsification (UGS) framework that simultaneously prunes the graph adjacency matrix and the model weights to accelerate GNN inference on large graphs. Specifically, two differentiable masks m g and m θ are applied to the adjacency matrix A and the model weights Θ, respectively, during end-to-end training by element-wise product. After training, lowest-magnitude elements in m g and m θ are set to zero w.r.t. pre-defined ratios p g and p θ , which basically eliminates low-scored edges and weights, respectively. The weight parameters are then rewound to their original initialization, and this pruning process is repeated until pre-defined sparsity levels are reached, i.e., graph sparsity 1 - ∥m g ∥ 0 ∥A∥ 0 ≥ s g and weight sparsity 1 - ∥m θ ∥ 0 ∥Θ∥ 0 ≥ s θ , where ∥.∥ 0 is the L 0 norm counting the number of non-zero elements. Intuitively, UGS simply extends the basic parameter-masking algorithm of Frankle & Carbin (2019) for identifying winning tickets to also mask and remove graph edges. However, our empirical study finds that the performance of a sparsified GNN degrades significantly when the graph sparsity goes beyond a certain level, while it is relatively insensitive to weight sparsification. Specifically, we compare UGS with its two variants: (1) "Only weight," which does not conduct graph sparsification, and (2) "80% edges," which stops pruning edges as soon as the graph sparsity is increased to 20% or above. Figure 1 (a) shows the performance comparison of UGS and the two variants on the Cora dataset (Chen et al., 2021b), where we can see that the accuracy of UGS (the red line) collapses when the graph sparsity becomes larger than 25%, while the two variants do not suffer from such a significant performance degradation since neither of them sparsifies the edges beyond 20%. Clearly, the performance of GNNs is vulnerable to graph sparsification: removing certain edges tends to undermine the underlying structure of the graph, hampering message passing along edges. In this paper, we propose two techniques to improve GNN performance when the graph sparsity is high. The first technique is based on the observation that in UGS, only a fraction of the adjacency matrix elements (i.e., graph edges) are involved in loss calculation. As an illustration, consider the semi-supervised node classification task shown in Figure 1 (b) where the nodes in yellow are labeled, and we assume that a GNN with 2 graph convolution layers is used so only nodes within two hops from the yellow nodes are involved in the training process, which are highlighted in green. Note that the gray edges in Figure 1 (b) are not involved in loss calculation, i.e., no message passing happens along the dashed edges so their corresponding mask elements get zero gradients during backpropagation. On the Cora dataset, we find that around 50% edges are in such a situation, leaving the values of their corresponding mask elements unchanged throughout the entire training process. After checking the source code of UGS, we find that it initializes these mask elements by adding a random noise, so the ordering of the edge mask-scores is totally determined by this initial mask randomization rather than the graph topology. As a result, the removal of low-scored edges tends to be random in later iterations of UGS, causing performance collapse as some important "dashed" edges are removed. To address this problem, we add a new auxiliary loss head to better guide the edge pruning by involving the entire adjacency matrix. Specifically, this loss head uses a novel loss function that measures the inter-class separateness of nodes with Wasserstein distance (WD). For each class, we calculate the WD between (1) the set of nodes that are predicted to be in the class and (2) the set of other nodes. By minimizing WD for all classes, we maximize the difference between the extracted node features of different classes. Now that this loss function involves all nodes, all elements in the graph mask m g will now have gradient during backpropagation. Our second technique is based on adversarial perturbation, which is widely used to improve the robustness of deep neural networks (Wong et al., 2020) . To improve the robustness of the graph lottery tickets (i.e., the pruned GNN subnetworks) when graph sparsity is high, we regard unfavorable graph sparsification as an adversarial data perturbation and formulate the pruning process as a min-max optimization problem. Specifically, a minimizer seeks to update both the weight parameters Θ and its mask m θ against a maximizer that aims to perturb the graph mask m g . By performing projected gradient ascent on the graph mask, we are essentially using adversarial perturbations to significantly improve the robustness of our graph lottery tickets against the graph sparsity. We further investigate the question: Can we use the obtained winning ticket of a GNN for graph transfer learning? Studying this problem is particularly interesting since the "retrainability" of a winning ticket (i.e., a pruned sub-network together with original weight initialization) on the same task is the most distinctive property of Lottery Ticket Hypothesis (LTH): many works (Liu



Figure 1: UGS Analysis

