JOINT EDGE-MODEL SPARSE LEARNING IS PROVABLY EFFICIENT FOR GRAPH NEURAL NETWORKS

Abstract

Due to the significant computational challenge of training large-scale graph neural networks (GNNs), various sparse learning techniques have been exploited to reduce memory and storage costs. Examples include graph sparsification that samples a subgraph to reduce the amount of data aggregation and model sparsification that prunes the neural network to reduce the number of trainable weights. Despite the empirical successes in reducing the training cost while maintaining the test accuracy, the theoretical generalization analysis of sparse learning for GNNs remains elusive. To the best of our knowledge, this paper provides the first theoretical characterization of joint edge-model sparse learning from the perspective of sample complexity and convergence rate in achieving zero generalization error. It proves analytically that both sampling important nodes and pruning neurons with lowest-magnitude can reduce the sample complexity and improve convergence without compromising the test accuracy. Although the analysis is centered on two-layer GNNs with structural constraints on data, the insights are applicable to more general setups and justified by both synthetic and practical citation datasets.

1. INTRODUCTION

Graph neural networks (GNNs) can represent graph structured data effectively and find applications in objective detection (Shi & Rajkumar, 2020; Yan et al., 2018) , recommendation system (Ying et al., 2018; Zheng et al., 2021) , rational learning (Schlichtkrull et al., 2018) , and machine translation (Wu et al., 2020; 2016) . However, training GNNs directly on large-scale graphs such as scientific citation networks (Hull & King, 1987; Hamilton et al., 2017; Xu et al., 2018) , social networks (Kipf & Welling, 2017; Sandryhaila & Moura, 2014; Jackson, 2010) , and symbolic networks (Riegel et al., 2020) becomes computationally challenging or even infeasible, resulting from both the exponential aggregation of neighboring features and the excessive model complexity, e.g., training a two-layer GNN on Reddit data (Tailor et al., 2020) containing 232,965 nodes with an average degree of 492 can be twice as costly as ResNet-50 on ImageNet (Canziani et al., 2016) in computation resources. The approaches to accelerate GNN training can be categorized into two paradigms: (i) sparsifying the graph topology (Hamilton et al., 2017; Chen et al., 2018; Perozzi et al., 2014; Zou et al., 2019) , and (ii) sparsifying the network model (Chen et al., 2021b; You et al., 2022) . Sparsifying the graph topology means selecting a subgraph instead of the original graph to reduce the computation of neighborhood aggregation. One could either use a fixed subgraph (e.g., the graph typology (Hübler et al., 2008) , graph shift operator (Adhikari et al., 2017; Chakeri et al., 2016) , or the degree distribution (Leskovec & Faloutsos, 2006; Voudigari et al., 2016; Eden et al., 2018) is preserved) or apply sampling algorithms, such as edge sparsification (Hamilton et al., 2017) , or node sparsification (Chen et al., 2018; Zou et al., 2019) to select a different subgraph in each iteration. Sparsifying the network model means reducing the complexity of the neural network model, including removing the non-linear activation (Wu et al., 2019; He et al., 2020) , quantizing neuron weights (Tailor et al., 2020; Bahri et al., 2021) and output of the intermediate layer (Liu et al., 2021) , pruning network (Frankle & Carbin, 2019) , or knowledge distillation (Yang et al., 2020; Hinton et al., 2015; Yao et al., 2020; Jaiswal et al., 2021) . Both sparsification frameworks can be combined, such as joint edge sampling and network model pruning in (Chen et al., 2021b; You et al., 2022) . Despite many empirical successes in accelerating GNN training without sacrificing test accuracy, the theoretical evaluation of training GNNs with sparsification techniques remains largely unexplored. Most theoretical analyses are centered on the expressive power of sampled graphs (Hamilton et al., 2017; Cong et al., 2021; Chen et al., 2018; Zou et al., 2019; Rong et al., 2019) or pruned networks (Malach et al., 2020; Zhang et al., 2021; da Cunha et al., 2022) . However, there is limited generalization analysis, i.e., whether the learned model performs well on testing data. Most existing generalization analyses are limited to two-layer cases, even for the simplest form of feed-forward neural networks (NNs), see, e.g., (Zhang et al., 2020a; Oymak & Soltanolkotabi, 2020; Huang et al., 2021; Shi et al., 2022) as examples. To the best of our knowledge, only Li et al. (2022) ; Allen-Zhu et al. (2019a) go beyond two layers by considering three-layer GNNs and NNs, respectively. However, Li et al. (2022) requires a strong assumption, which cannot be justified empirically or theoretically, that the sampled graph indeed presents the mapping from data to labels. Moreover, Li et al. (2022) ; Allen-Zhu et al. (2019a) focus on a linearized model around the initialization, and the learned weights only stay near the initialization (Allen-Zhu & Li, 2022) . The linearized model cannot justify the advantages of using multi-layer (G)NNs and network pruning. As far as we know, there is no finite-sample generalization analysis for the joint sparsification, even for two-layer GNNs. Contributions. This paper provides the first theoretical generalization analysis of joint topologymodel sparsification in training GNNs, including (1) explicit bounds of the required number of known labels, referred to as the sample complexity, and the convergence rate of stochastic gradient descent (SGD) to return a model that predicts the unknown labels accurately; (2) quantitative proof for that joint topology and model sparsification is a win-win strategy in improving the learning performance from the sample complexity and convergence rate perspectives. We consider the following problem setup to establish our theoretical analysis: node classification on a one-hidden-layer GNN, assuming that some node features are class-relevant (Shi et al., 2022) , which determines the labels, while some node features are class-irrelevant, which contains only irrelevant information for labeling, and the labels of nodes are affected by the class-relevant features of their neighbors. The data model with this structural constraint characterizes the phenomenon that some nodes are more influential than other nodes, such as in social networks (Chen et al., 2018; Veličković et al., 2018) , or the case where the graph contains redundancy information (Zheng et al., 2020) . Specifically, the sample complexity is quadratic in (1 -β)/α, where α in (0, 1] is the probability of sampling nodes of class-relevant features, and a larger α means class-relevant features are sampled more frequently. β in [0, 1) is the fraction of pruned neurons in the network model using the magnitude-based pruning method such as (Frankle & Carbin, 2019) . The number of SGD iterations to reach a desirable model is linear in (1 -β)/α. Therefore, our results formally prove that graph sampling reduces both the sample complexity and number of iterations more significantly provided that nodes with class-relevant features are sampled more frequently. The intuition is that importance sampling helps the algorithm learns the class-relevant features more efficiently and thus reduces the sample requirement and convergence time. The same learning improvement is also observed when the pruning rate increases as long as β does not exceed a threshold close to 1. Given an undirected graph G(V, E), where V is the set of nodes, E is the set of edges. Let R denote the maximum node degree. For any node v ∈ V, let x v ∈ R d and y v ∈ {+1, -1} denote its input feature and corresponding labelfoot_0 , respectively. Given all node features {x v } v∈V and partially known labels {y v } v∈D for nodes in D ⊂ V, the semi-supervised node classification problem aims to predict all unknown labels y v for v ∈ V/D.

2. GRAPH NEURAL NETWORKS: FORMULATION AND ALGORITHM

This paper considers a graph neural network with non-linear aggregator functions, as shown in Figure 1 . The weights of the K neurons in the hidden layer are denoted as {w k ∈ R d } K k=1 , and the weights in the linear layer are denoted as {b k ∈ R} K k=1 . Let W ∈ R d×K and b ∈ R K be the concatenation of {w k } K k=1 and {b k } K k=1 , respectively. For any node v, let N (v) denote the set of its (1-hop) neighbors (with self-connection), and X N (v) ∈ R d×|N (v)| contains features in N (v). Therefore, the output of the GNN for node v can be written as: g(W , b; X N (v) ) = 1 K K k=1 b k • AGG(X N (v) , w k ), where AGG(X N (v) , w) denotes a general aggregator using features X N (v) and weight w, e.g., weighted sum of neighbor nodes (Veličković et al., 2018) , max-pooling (Hamilton et al., 2017) , or min-pooling (Corso et al., 2020) with some non-linear activation function. We consider ReLU as ϕ(•) = max{•, 0}. Given the GNN, the label y v at node v is predicted by sign(g(W , b; X N (v) )). We only update W due to the homogeneity of ReLU function, which is a common practice to simplify the analysis (Allen-Zhu et al., 2019a; Arora et al., 2019; Oymak & Soltanolkotabi, 2020; Huang et al., 2021) . The training problem minimizes the following empirical risk function (ERF): min W : fD (W , b (0) ) := - 1 |D| v∈D y v • g W , b (0) ; X N (v) . The test error is evaluated by the following generalization error function: I(g(W , b)) = 1 |V| v∈V max 1 -y v • g W , b; X N (v) , 0 . If I(g(W , b)) = 0, y v = sign(g(W , b; X N (v) )) for all v, indicating zero test error. Albeit different from practical GNNs, the model considered in this paper can be viewed as a onehidden-layer GNN, which is the state-of-the-art practice in generalization and convergence analyses with structural data (Brutzkus & Globerson, 2021; Damian et al., 2022; Shi et al., 2022; Allen-Zhu & Li, 2022) . Moreover, the optimization problem of (2) is already highly non-convex due to the non-linearity of ReLU functions. For example, as indicated in (Liang et al., 2018; Safran & Shamir, 2018) , one-hidden-layer (G)NNs contains intractably many spurious local minima. In addition, the VC dimension of the GNN model and data distribution considered in this paper is proved to be at least an exponential function of the data dimension (see Appendix G for the proof). This model is highly expressive, and it is extremely nontrivial to obtain a polynomial sample complexity, which is one contribution of this paper.

2.2. GNN LEARNING ALGORITHM VIA JOINT EDGE AND MODEL SPARSIFICATION

The GNN learning problem (2) is solved via a mini-batch SGD algorithm, as summarized in Algorithm 1. The coefficients b k 's are randomly selected from +1 or -1 and remain unchanged during training. The weights w k in the hidden layer are initialized from a multi-variate Gaussian N (0, δ 2 I d ) with a small constant δ, e.g. δ = 0.1. The training data are divided into disjoint subsets, and one subset is used in each iteration to update W through SGD. Algorithm 1 contains two training stages: pre-training on W (lines 1-4) with few iterations, and re-training on the pruned model M ⊙ W (lines 5-8), where ⊙ stands for entry-wise multiplication. Here neuron-wise magnitude pruning is used to obtain a weight mask M and graph topology sparsification is achieved by node sampling (line 8). During each iteration, only part of the neighbor nodes is fed into the aggregator function in (1) at each iteration, where N (s) (t) denotes the sampled subset of neighbors of node v at iteration t. Edge sparsification samples a subset of neighbors, rather than the entire N (v), for every node v in computing (1) to reduce the per-iteration computational complexity. This paper follows the GraphSAGE framework (Hamilton et al., 2017) , where r (r ≪ R) neighbors are sampled (all these neighbors are sampled if |N (v)| ≤ r) for each node at each iteration. At iteration t, the gradient is ∇ f (t) D (W , b (0) ) = - 1 |D| v∈D y v • ∇ W g W , b (0) ; X N (t) s (v) .

Algorithm 1 Training GNN via Joint Edge and Model Sparsification

Input: Node features X, known node labels{yv}v∈D with D ⊆ V, step size cη = 10δ with constant δ, the number of sampled edges r, the pruning rate β, the pre-training iterations T ′ = ∥X∥∞/cη, the number of iterations T . Initialization: 0) ); 5: end for Pruning: set β fraction of neurons in W (T ′ ) with the lowest magnitude weights to 0, and obtain the corresponding binary mask M ; Re-training: rewind the weights to the original initialization as M ⊙ W (0) , and update model weights through SGD with edge sampling; 6: Divide D into disjoint subsets {D (t) } T t=1 7: W (0) , b (0) as w (0) k ∼ N (0, δ 2 I d ) and b (0) k ∼ Uniform({-1, +1}) for k ∈ [K]; Pre-training: update model weights W through mini-batch SGD with edge sampling 1: Divide D into disjoint subsets {D (t ′ ) } T ′ t ′ =1 2: for t ′ = 0, 1, 2, • • • , T ′ -1 do 3: Sample N (t ′ ) s (v) for every node v in D (t ′ ) ; 4: W (t ′ +1) = W (t ′ ) -cη • ∇ W f (t) D (W (t ′ ) , b for t = 0, 1, 2, • • • , T do 8: Sample N (t) s (v) for every node v in D (t) ; 9: t) , b (0) ); 10: end for Return: W (T ) and b (0) . W (t+1) = W (t) -cη • ∇ W f (t) D (t) (M ⊙ W ( where D ⊆ V is the subset of training nodes with labels. The aggregator function used in this paper is the max-pooling function, i.e.,

AGG(X

N (t) s (v) , w) = max n∈N (t) s (v) ϕ(⟨ w , x n ⟩), which has been widely used in GraphSAGE (Hamilton et al., 2017) and its variants (Guo et al., 2021; Oh et al., 2019; Zhang et al., 2022b; Lo et al., 2022) . This paper considers an importance sampling strategy with the idea that some nodes are sampled with a higher probability than other nodes, like the sampling strategy in (Chen et al., 2018; Zou et al., 2019; Chen et al., 2021b) . Model sparsification first pre-trains the neural network (often by only a few iterations) and then prunes the network by setting some neuron weights to zero. It then re-trains the pruned model with fewer parameters and is less computationally expensive to train. Existing pruning methods include neuron pruning and weight pruning. The former sets all entries of a neuron w k to zeros simultaneously, while the latter sets entries of w k to zeros independently. This paper considers neuron pruning. Similar to (Chen et al., 2021b) , we first train the original GNN until the algorithm converges. Then, magnitude pruning is applied to neurons via removing a β (β ∈ [0, 1)) ratio of neurons with the smallest norm. Let M ∈ {0, 1} d×K be the binary mask matrix with all zeros in column k if neuron k is removed. Then, we rewind the remaining GNN to the original initialization (i.e., M ⊙ W (0) ) and re-train on the model M ⊙ W .

3.1. TAKEAWAYS OF THE THEORETICAL FINDINGS

Before formally presenting our data model and theoretical results, we first briefly introduce the key takeaways of our results. We consider the general setup that the node features as a union of the noisy realizations of some class-relevant features and class-irrelevant ones, and δ denotes the upper bound of the additive noise. The label y v of node v is determined by class-relevant features in N (v). We assume for simplicity that N (v) contains the class-relevant features for exactly one class. Some major parameters are summarized in Table 1 . The highlights include: (T1) Sample complexity and convergence analysis for zero generalization error. We prove that the learned model (with or without any sparsification) can achieve zero generalization error with high probability over the randomness in the initialization and the SGD steps. The sample complexity is linear in σ 2 and K -1 . The number of iterations is linear in σ and K -1/2 . Thus, the learning performance is enhanced in terms of smaller sample complexity and faster convergence if the neural network is slightly over-parameterized. (T2) Edge sparsification and importance sampling improve the learning performance. The sample complexity is a quadratic function of r, indicating that edge sparsification reduces the sample complexity. The intuition is that edge sparsification reduces the level of aggregation of class-relevant with class-irrelevant features, making it easier to learn class-relevant patterns for the considered data model, which improves the learning performance. The sample complexity and the number of iterations are quadratic and linear in α -1 , respectively. As a larger α means the class-relevant features are sampled with a higher probability, this result is consistent with the intuition that a successful importance sampling strategy helps to learn the class-relevant features faster with fewer samples. (T3) Magnitude-based model pruning improves the learning performance. Both the sample complexity and the computational time are linear in (1 -β)foot_1 , indicating that if more neurons with small magnitude are pruned, the sample complexity and the computational time are reduced. The intuition is that neurons that accurately learn class-relevant features tend to have a larger magnitude than other neurons', and removing other neurons makes learning more efficient. (T4) Edge and model sparsification is a win-win strategy in GNN learning. Our theorem provides a theoretical validation for the success of joint edge-model sparsification. The sample complexity and the number of iterations are quadratic and linear in 1-β α , respectively, indicating that both techniques can be applied together to effectively enhance learning performance.

3.2. FORMAL THEORETICAL RESULTS

Data model. Let P = {p i } L i=1 (∀L ≤ d) denote an arbitrary set of orthogonal vectors 2 in R d . Let p + := p 1 and p -:= p 2 be the positive-class and negative-class pattern, respectively, which bear the causal relation with the labels. The remaining vectors in P are class irrelevant patterns. The node features x v of every node v is a noisy version of one of these patterns, i.e., x v = p v + z v , where p v ∈ P, and z v is an arbitrary noise at node v with ∥z∥ 2 ≤ σ for some σ. y v is +1 (or -1) if node v or any of its neighbors contains p + (or p -). Specifically, divide V into four disjoint sets V + , V -, V N + and V N -based on whether the node feature is relevant or not (N in the subscript) and the label, i.e., We assume p v = p + , ∀v ∈ V + ; p v = p -, ∀v ∈ V -; and p v ∈ {p 3 , ..., p L }, ∀v ∈ V N + ∪ V N -. Then, we have y v = 1, ∀v ∈ V + ∪ V N + , and y v = -1, ∀v ∈ V -∪ V N -. 1 4 2 3 𝒱𝒱 + 𝒱𝒱 𝑁𝑁+ 𝒱𝒱 - 𝒱𝒱 𝑁𝑁- (A1) Every v in V N + (or V N -) is connected to at least one node in V + (or V -). There is no edge between V + and V N -. There is no edge between V -and V N + . (A2) The positive and negative labels in D are balanced, i.e., |D ∩(V + ∪V N + )|-|D ∩(V -∪V N -)| = O( |D|). (A1) indicates that connected nodes in the graph tend to have the same labels and eliminates the case that node v is connected to both p + and p -to simplify the analysis. A numerical justification of such an assumption in Cora dataset can be found in Appendix F.2. (A2) can be relaxed to the case that the observed labels are unbalanced. One only needs to up-weight the minority class in the ERF in (2) accordingly, which is a common trick in imbalance GNN learning (Chen et al., 2021a) , and our analysis holds with minor modification. The data model of orthogonal patterns is introduced in (Brutzkus & Globerson, 2021) to analyze the advantage of CNNs over fully connected neural networks. It simplifies the analysis by eliminating the interaction of class-relevant and class-irrelevant patterns. Here we generalize to the case that the node features contain additional noise and are no longer orthogonal. To analyze the impact of importance sampling quantitatively, let α denote a lower bound of the probability that the sampled neighbors of v contain at least one node in V + or V -for any node v (see Table 1 ). Clearly, α = r/R is a lower bound for uniform samplingfoot_2 . A larger α indicates that the sampling strategy indeed selects nodes with class-relevant features more frequently. Theorem 1 concludes the sample complexity (C1) and convergence rate (C2) of Algorithm 1 in learning graph-structured data via graph sparsification. Specifically, the returned model achieves zero generalization error (from ( 8)) with enough samples (C1) after enough number of iterations (C2). Theorem 1. Let the step size c η be some positive constant and the pruning rate β ∈ [0, 1 -1/L). Given the bounded noise such that σ < 1/L and sufficient large model such that K > L 2 • log q for some constant q > 0. Then, with probability at least 1 -q -10 , when (C1) the number of labeled nodes satisfies |D| = Ω (1 + L 2 σ 2 + K -1 ) • α -2 • (1 + r 2 ) • (1 -β) 2 • L 2 • log q , (C2) the number of iterations in the re-training stage satisfies T = Ω c -1 η (1 + |D| -1/2 ) • (1 + Lσ + K -1/2 ) • (1 -β) • α -1 • L , the model returned by Algorithm 1 achieves zero generalization error, i.e., I g(W (T ) , U (T ) ) = 0.

3.3. TECHNICAL CONTRIBUTIONS

Although our data model is inspired by the feature learning framework of analyzing CNNs (Brutzkus & Globerson, 2021) , the technical framework in analyzing the learning dynamics differs from existing ones in the following aspects. First, our work provides the first polynomial-order sample complexity bound in (6) that quantitatively characterizes the parameters' dependence with zero generalization error. In (Brutzkus & Globerson, 2021) , the generalization bound is obtained by updating the weights in the second layer (linear layer) while the weights in the non-linear layer are fixed. However, the high expressivity of neural networks mainly comes from the weights in non-linear layers. Updating weights in the hidden layer can achieve a smaller generalization error than updating the output layer. Therefore, this paper obtains a polynomial-order sample complexity bound by characterizing the weights update in the non-linear layer (hidden layer), which cannot be derived from (Brutzkus & Globerson, 2021) . In addition, updating weights in the hidden layer is a non-convex problem, which is more challenging than the case of updating weights in the output layer as a convex problem. Second, the theoretical framework in this paper can characterize the magnitude-based pruning method while the approach in (Brutzkus & Globerson, 2021) cannot. Specifically, our analysis provides a tighter bound such that the lower bound of "lucky neurons" can be much larger than the upper bound of "unlucky neurons" (see Lemmas 2-5), which is the theoretical foundation in characterizing the benefits of model pruning but not available in (Brutzkus & Globerson, 2021) (see Lemmas 5.3 & 5.5 ). On the one hand, (Brutzkus & Globerson, 2021 ) only provides a uniform bound for "unlucky neurons" in all directions, but Lemma 4 in this paper provides specific bounds in different directions. On the other hand, this paper considers the influence of the sample amount, and we need to characterize the gradient offsets between positive and negative classes. The problem is challenging due to the existence of class-irrelevant patterns and edge sampling in breaking the dependence between the labels and pattern distributions, which leads to unexpected distribution shifts. We characterize groups of special data as the reference such that they maintain a fixed dependence on labels and have a controllable distribution shift to the sampled data. Third, the theoretical framework in this paper can characterize the edge sampling while the approach in (Brutzkus & Globerson, 2021) cannot. (Brutzkus & Globerson, 2021) requires the data samples containing class-relevant patterns in training samples via margin generalization bound (Shalev-Shwartz & Ben-David, 2014; Bartlett & Mendelson, 2002) . However, with data sampling, the sampled data may no longer contain class-relevant patterns. Therefore, updating on the second layer is not robust, but our theoretical results show that updating in the hidden layer is robust to outliers caused by egde sampling.

3.4. THE PROOF SKETCH

Before presenting the formal roadmap of the proof, we provide a high-level illustration by borrowing the concept of "lucky neuron", where such a node has good initial weights, from (Brutzkus & Globerson, 2021) . We emphasize that only the concept is borrowed, and all the properties of the "lucky neuron", e.g., (10) to ( 13), are developed independently with excluded theoretical findings from other papers. In this paper, we justify that the magnitude of the "lucky neurons" grows at a rate of sampling ratio of class-relevant features, while the magnitude of the "unlucky neurons" is upper bounded by the inverse of the size of training data (see proposition 2). With large enough training data, the "lucky neurons" have large magnitudes and dominate the output value. By pruning neurons with small magnitudes, we can reserve the "lucky neurons" and potentially remove "unlucky neurons" (see proposition 3). In addition, we prove that the primary direction of "lucky neurons" is consistence with the class-relevant patterns, and the ratio of "lucky neurons" is sufficiently large (see proposition 1). Therefore, the output is determined by the primary direction of the "lucky neuron", which is the corresponding class-relevant pattern Specifically, we will prove that, for every node v with y v = 1, the prediction by the learned weights W (T ) is accurate, i.e., g(M ⊙ W (T ) , b (0) ; X N (v) ) > 1. The arguments for nodes with negative labels are the same. Then, the zero test error is achieved from the defined generalization error in (3). Divide the neurons into two subsets B + = {k | b (0) k = +1} and B -= {k | b (0) k = -1}. We first show that there exist some neurons i in B + with weights w (t) i that are close to p + for all iterations t ≥ 0. These neurons, referred to as "lucky neurons," play a dominating role in classifying v, and the fraction of these neurons is at least close to 1/L. Formally, Proposition 1. Let K + ⊆ B + denote the set of "lucky neurons" that for any i in K + , min ∥z∥2≤σ ⟨ w (t) i , p + + z ⟩ ≥ max p∈P/p+,∥z∥2≤σ ⟨ w (t) i , p + z ⟩ for all t. Then it holds that |K + |/K ≥ (1 -K -1/2 -Lσ)/L. We next show in Proposition 2 that when |D| is large enough, the projection of the weight w (t) i of a lucky neuron i on p + grows at a rate of c η α. Then importance sampling with a large α corresponds to a high rate. In contrast, the neurons in B -increase much slower in all directions except for p -. Proposition 2. ⟨ w (t) i , p + ⟩ ≥ c η α -σ (1 + r 2 )/|D| t, ∀i ∈ K + , ∀t |⟨ w (t) j , p ⟩| ≤ c η (1 + σ) (1 + r 2 )/|D| • t, ∀j ∈ B -, ∀p ∈ P/p -, ∀t. Proposition 3 shows that the weights magnitude of a "lucky neuron" in K + is larger than that of a neuron in B + /K + . Combined with (10), "lucky neurons" will not be pruned by magnitude pruning, as long as β < 1 -1/L. Let K β denote the set of neurons after pruning with |K β | = (1 -β)K. Proposition 3. There exists a small positive integer C such that ∥w (t) i ∥ 2 > ∥w (t) j ∥ 2 , ∀i ∈ K + , ∀j ∈ B + /K + , ∀t ≥ C. (12) Moreover, K β ∩ K + = K + for all β ≤ 1 -1/L. Therefore, with a sufficiently large number of samples, the magnitudes of lucky neurons increase much faster than those of other neurons (from proposition 2). Given a sufficiently large fraction of lucky neurons (from proposition 1), the outputs of the learned model will be strictly positive. Moreover, with a proper pruning rate, the fraction of lucky neurons can be further improved (from proposition 3), which leads to a reduced sample complexity and faster convergence rate. In the end, we consider the case of no feature noise to illustrate the main computation Published as a conference paper at ICLR 2023 g(M ⊙ W (T ) , M ⊙ b (0) ; X N (v) ) = 1 K i∈B + ∩K β max u∈N (v) ϕ(⟨ w (T ) i , xu ⟩) -j∈B -∩K β max u∈N (v) ϕ(⟨ w (T ) j , xu ⟩) ≥ 1 K i∈K + max u∈N (v) ϕ(⟨ w (T ) i , xu ⟩) -1 K j∈B -∩K β max p∈P/p - |⟨ w (T ) j , p ⟩| ≥ α|K+|/K -(1 -β)(1 + σ) (1 + r 2 )/|D| cηT > 1, where the first inequality follows from the fact that K β ∩ K + = K + , ϕ is the nonnegative ReLU function, and N (v) does not contain p -for a node v with y v = +1. The second inequality follows from Proposition 2. The last inequality follows from (10), and conclusions (C1) & (C2). That completes the proof. Please see the supplementary material for details.

4.1. SYNTHETIC DATA EXPERIMENTS

We generate a graph with 10000 nodes, and the node degree is 30. The one-hot vectors e 1 and e 2 are selected as p + and p -, respectively. The class-irrelevant patterns are randomly selected from the null space of p + and p -. That is relaxed from the orthogonality constraint in the data model. ∥p∥ 2 is normalized to 1 for all patterns. The noise z v belongs to Gaussian N (0, σ 2 ). The node features and labels satisfy (A1) and (A2), and details of the construction can be found in Appendix F. The test error is the percentage of incorrect predictions of unknown labels. The learning process is considered as a success if the returned model achieves zero test error. Sample Complexity. We first verify our sample complexity bound in ( 6). Every result is averaged over 100 independent trials. A white block indicates that all the trials are successful, while a black block means all failures. In these experiments, we vary one parameter and fix all others. In Figure 3 , r = 15 and we vary the importance sampling probability α. The sample complexity is linear in α -2 . Figure 4 indicates that the sample complexity is almost linear in (1 -β) 2 up to a certain upper bound, where β is the pruning rate. All these are consistent with our theoretical predictions in (6). Training Convergence Rate. Next, we evaluate how sampling and pruning reduce the required number of iterations to reach zero generalization error. Figure 5 shows the required number of iterations for different α under different noise level σ. Each point is averaged over 1000 independent realizations, and the regions in low transparency denote the error bars with one standard derivation. We can see that the number of iterations is linear in 1/α, which verifies our theoretical findings in (6). Thus, importance sampling reduces the number of iterations for convergence. Figure 6 illustrates the required number of iterations for convergence with various pruning rates. The baseline is the average iterations of training the dense networks. The required number of iterations by magnitude pruning is almost linear in β, which verifies our theoretical findings in (7). In comparison, random pruning degrades the performance by requiring more iterations than the baseline to converge. Magnitude pruning removes neurons with irrelevant information. Figure 7 shows the distribution of neuron weights after the algorithm converges. There are 10 4 points by collecting the neurons in 100 independent trials. The y-axis is the norm of the neuron weights w k , and the y-axis stands for the angle of the neuron weights between p + (bottom) or p -(top). The blue points in cross represent w k 's with b k = 1, and the red ones in circle represent w k 's with b k = -1. In both cases, w k with a small norm indeed has a large angle with p + (or p -) and thus, contains class-irrelevant information for classifying class +1 (or -1). Figure 7 verifies Proposition 3 showing that magnitude pruning removes neurons with class-irrelavant information. Performance enhancement with joint edge-model sparsification. Figure 8 illustrates the learning success rate when the importance sampling probability α and the pruning ratio β change. For each pair of α and β, the result is averaged over 100 independent trials. We can observe when either α or β increases, it becomes more likely to learn a desirable model. We evaluate the joint edge-model sparsification algorithms in real citation datasets (Cora, Citeseer, and Pubmed) (Sen et al., 2008) on the standard GCN (a two-message passing GNN) (Kipf & Welling, 2017) . The Unified GNN Sparsification (UGS) in (Chen et al., 2021b) is implemented here as the edge sampling method, and the model pruning approach is magnitude-based pruning. Figure 9 shows the performance of node classification on Cora dataset. As we can see, the joint sparsification helps reduce the sample complexity required to meet the same test error of the original model. For example, P 2 , with the joint rates of sampled edges and pruned neurons as (0.90,0.49), and P 3 , with the joint rates of sampled edges and pruned neurons as (0.81,0.60), return models that have better testing performance than the original model (P 1 ) trained on a larger data set. By varying the training sample size, we find the characteristic behavior of our proposed theory: the sample complexity reduces with the joint sparsification. Figure 10 shows the test errors on the Citeseer dataset under different sparsification rates, and darker colors denote lower errors. In both figures, we observe that the joint edge sampling and pruning can reduce the test error even when more than 90% of neurons are pruned and 25% of edges are removed, which justifies the efficiency of joint edge-model sparsification. In addition, joint modeledge sparsification with a smaller number of training samples can achieve similar or even better performance than that without sparsification. For instance, when we have 120 training samples, the test error is 30.4% without any specification. However, the joint sparsification can improve the test error to 28.7% with only 96 training samples. We only include partial results due to the space limit. Please see the supplementary materials for more experiments on synthetic and real datasets. 

A RELATED WORKS

Generalization analysis of GNNs. Two recent papers (Du et al., 2019; Xu et al., 2021) exploit the neural tangent kernel (NTK) framework (Malach et al., 2020; Allen-Zhu et al., 2019b; Jacot et al., 2018; Du et al., 2018; Lee et al., 2018) for the generalization analysis of GNNs. It is shown in (Du et al., 2019) that the graph neural tangent kernel (GNTK) achieves a bounded generalization error only if the labels are generated from some special function, e.g., the function needs to be linear or even. (Xu et al., 2021) analyzes the generalization of deep linear GNNs with skip connections. The NTK approach considers the regime that the model is sufficiently over-parameterized, i.e., the number of neurons is a polynomial function of the sample amount, such that the landscape of the risk function becomes almost convex near any initialization. The required model complexity is much more significant than the practical case, and the results are irrelevant of the data distribution. As the neural network learning process is strongly correlated with the input structure (Shi et al., 2022) , distribution-free analysis, such as NTK, might not accurately explain the learning performance on data with special structures. Following the model recovery frameworks (Zhong et al., 2017; Zhang et al., 2022a; 2020b) , Zhang et al. (2020a) analyzes the generalization of one-hidden-layer GNNs assuming the features belong to Gaussian distribution, but the analysis requires a special tensor initialization method and does not explain the practical success of SGD with random initialization. Besides these, the generalization gap between the training and test errors is characterized through the classical Rademacher complexity in (Scarselli et al., 2018; Garg et al., 2020) and uniform stability framework in (Verma & Zhang, 2019; Zhou & Wang, 2021) . Generalization analysis with structural constraints on data. Assuming the data come from mixtures of well-separated distributions, (Li & Liang, 2018) analyzes the generalization of onehidden-layer fully-connected neural networks. Recent works (Shi et al., 2022; Brutzkus & Globerson, 2021; Allen-Zhu & Li, 2022; Karp et al., 2021; Wen & Li, 2021; Li et al., 2023) analyze one-hiddenlayer neural networks assuming the data can be divided into discriminative and background patterns. Neural networks with non-linear activation functions memorize the discriminative features and have guaranteed generalization in the unseen data with same structural constraints, while no linear classifier with random initialization can learn the data mapping in polynomial sizes and time (Shi et al., 2022; Daniely & Malach, 2020) . Nevertheless, none of them has considered GNNs or sparsification.

B OVERVIEW OF THE TECHNIQUES

Before presenting the proof details, we will provide a high-level overview of the proof techniques in this section. To warm up, we first summarize the proof sketch without edge and model sparsification methods. Then, we illustrate the major challenges in deriving the results for edge and model sparsification approaches.

B.1 GRAPH NEURAL NETWORK LEARNING ON DATA WITH STRUCTURAL CONSTRAINTS

For the convenience of presentation, we use D + and D -to denote the set of nodes with positive and negative labels in D, respectively, where D + = (V + ∪ V N + ) ∩ D and D -= (V -∪ V N -) ∩ D. Recall that p + only exists in the neighbors of node v ∈ D + , and p -only exists in the neighbors of node v ∈ D -. In contrast, class irrelevant patterns are distributed identically for data in D + and D -. In addition, for some neuron, the gradient direction will always be near p + for v ∈ D + , while the gradient derived from v ∈ D -is always almost orthogonal to p + . Such neuron is the lucky neuron defined in Proposition 1 in Section 3.4, and we will formally define the lucky neuron in Appendix C from another point of view. Take the neurons in B + for instance, where  B + = {k | b k = +1} 𝒘𝒘 (0) Gradient from 𝒟𝒟 + Gradient from 𝒟𝒟 - 𝐶𝐶 + 𝒪𝒪 1 + 𝑅𝑅 2 |𝒟𝒟| 𝐶𝐶 + 𝒪𝒪 1 + 𝑅𝑅 2 |𝒟𝒟| 𝒑𝒑 + 𝑜𝑜𝑜𝑜 𝒑𝒑 - 𝒘𝒘 (0) 𝒘𝒘 (1) 𝒘𝒘 (2) 𝒘𝒘 (3) 𝒘𝒘 (4) 𝒑𝒑 + 𝑜𝑜𝑜𝑜 𝒑𝒑 - 𝒑𝒑 ∈ 𝒫𝒫 𝑁𝑁 𝒑𝒑 ∈ 𝒫𝒫 𝑁𝑁 (a) 𝒘𝒘 (0) Gradient from 𝒟𝒟 + Gradient from 𝒟𝒟 - 𝐶𝐶 + 𝒪𝒪 1 + 𝑅𝑅 2 |𝒟𝒟| 𝐶𝐶 + 𝒪𝒪 1 + 𝑅𝑅 2 |𝒟𝒟| 𝒑𝒑 + 𝑜𝑜𝑜𝑜 𝒑𝒑 - 𝒘𝒘 (0) 𝒘𝒘 (1) 𝒘𝒘 (2) 𝒘𝒘 (3) 𝒘𝒘 (4) 𝒑𝒑 + 𝑜𝑜𝑜𝑜 𝒑𝒑 - ∈ 𝒫𝒫 𝑁𝑁 𝒑𝒑 ∈ 𝒫𝒫 𝑁𝑁 (b) Figure 11 : Illustration of iterations {w (t) } T t=1 : (a) lucky neuron, and (b) unlucky neuron. Similar to the derivation of w (t) k for k ∈ B + , we can show that neurons in K -, which are the lucky neurons with respect to k ∈ B -, have their weights updated mainly in the direction of p -. Recall that the output of GNN model is written as (20), the corresponding coefficients in the linear layer for k ∈ B + are all positive. With these in hand, we know that the neurons in B + have a relatively large magnitude in the direction of p + compared with other patterns, and the corresponding coefficients b k are positive. Then, for the node v ∈ V + ∪ V N + , the calculated label will be strictly positive. Similar to the derivation above, the calculated label for the node v ∈ V -∪ V N -will be strictly negative. Edge sparsification. Figure 11(b) shows that the gradient in the direction of any class irrelevant patterns is a linear function of √ 1 + R 2 without sampling. Sampling on edges can significantly reduce the degree of the graph, i.e., the degree of the graph is reduced to r when only sampling r neighbor nodes. Therefore, the projection of neuron weights on any p ∈ P N becomes smaller, and the neurons are less likely to learn class irrelevant features. In addition, the computational complexity per iteration is reduced since we only need to traverse a subset of the edges. Nevertheless, sampling on graph edges may lead to missed class-relevant features in some training nodes (a smaller α), which will degrade the convergence rate and need a larger number of iterations. Model sparsification. Comparing Figure 11 (a) and 11(b), we can see that the magnitudes of a lucky neuron grow much faster than these of an unlucky neuron. In addition, from Lemma 6, we know that the lucky neuron at initialization will always be the lucky neuron in the following iterations. Therefore, the magnitude-based pruning method on the original dense model removes unlucky neurons but preserves the lucky neurons. When the fraction of lucky neurons is improved, the neurons learn the class-relevant features faster. Also, the algorithm can tolerate a larger gradient noise derived from the class irrelevant patterns in the inputs, which is in the order of 1/ |D| from Figure 11(b) . Therefore, the required samples for convergence can be significantly reduced. Noise factor z. The noise factor z degrades the generalization mainly in the following aspects. First, in Brutzkus & Globerson (2021) , the sample complexity depends on the size of {x v } v∈D , which, however, can be as large as |D| when there is noise. Second, the fraction of lucky neurons is reduced as a function of the noise level. With a smaller fraction of lucky neurons, we require a larger number of training samples and iterations for convergence.

C NOTATIONS

In this section, we implement the details of data model and problem formulation described in Section 3.2, and some important notations are defined to simplify the presentation of the proof. In addition, all the notations used in the following proofs are summarized in Tables 2 and 3 .

C.1 DATA MODEL WITH STRUCTURAL CONSTRAINTS

Recall the definitions in Section 3.2, the node feature for node v is written as x v = p v + z v , where p v ∈ P, and z v is bounded noise with ∥z v ∥ 2 ≤ σ. In addition, there are L orthogonal patterns in P, denoted as {p ℓ } L ℓ=1 . p + := p 1 is the positive class relevant pattern, p -:= p 2 is the negative class relevant pattern, and the rest of the patterns, denoted as P N , are the class irrelevant patterns. For node v, its label y v is positive or negative if its neighbors contain p + or p -. By saying a node v contains class relevant feature, we indicate that p v = p + or p -. Depending on x v and y v , we divide the nodes in V into four disjoint partitions, i.e., V = V + ∪ V -∪ V N + ∪ V N -, where V + := {v | p v = p + }; V -:= {v | p v = p -}; V N + := {v | p v ∈ P N , y v = +1}; V N -:= {v | p v ∈ P N , y v = -1}. (15) Then, we consider the model such that (i) the distribution of p + and p -are identical, namely, Prob(p v = p + ) = Prob(p v = p -), and (ii) p ∈ P N are identically distributed in V N + and V N -, namely, Prob(p | v ∈ V N + ) = Prob(p | v ∈ V N -) for any p ∈ P N . ( ) It is easy to verify that, when ( 16) and ( 17) hold, the number of positive and negative labels in D are balanced, such that Prob(y v = +1) = Prob(y v = -1) = 1 2 . ( ) If D + and D -are highly unbalanced, namely, |D + | -|D -| ≫ |D|, the objective function in (2) can be modified as fD := - 1 2|D + | v∈D+ y v • g W ; X N (v) - 1 2|D -| v∈D- y v • g W ; X N (v) , and the required number of samples |D| in ( 6) is replaced with min{|D + |, |D -|}.

C.2 GRAPH NEURAL NETWORK MODEL

It is easy to verify that (1) is equivalent to the model g(W , U ; x) = 1 K k AGG(X N (v) , w k ) - 1 K k AGG(X N (v) , u k ), where the neuron weights {w k } K k=1 in (20) are with respect to the neuron weights with b k = +1 in (1), and the neuron weights {u k } K k=1 in (20) are respect to the neuron weights with b k = -1 in (1). Here, we abuse K to represent the number of neurons in {k|b k = +1} or {k|b k = -1}, which differs from the K in (1) by a factor of 2. Since this paper aims at providing order-wise analysis, the bounds for K in (1) and ( 20) are the same. Corresponding to the model in (20), we denote M + as the mask matrix after pruning with respect to W and M -as the mask matrix after pruning with respect to U . For the convenience of analysis, we consider balanced pruning in W and U , i.e., ∥M + ∥ 0 = ∥M -∥ 0 = (1 -β)Kd.

C.3 ADDITIONAL NOTATIONS FOR THE PROOF

Pattern function M(v). Now, recall that at iteration t, the aggregator function for node v is written as AGG(X N (t) (v) , w (t) k ) = max n∈N (t) (v) ϕ(⟨ w (t) k , x n ⟩). Then, at iteration t, we define the pattern function M (t) : V → P {0} at iteration t as M (t) (v; w) =      0, if max n∈N (t) (v) ϕ(⟨ w , x n ⟩) ≤ 0 argmax {xn|n∈N (t) (v)} ϕ(⟨ w , x n ⟩), otherwise . Similar to the definition of p v in (14), we define M (t) p and M (t) z such that M (t) p is the noiseless pattern with respect to M (t) while M (t) z is noise with respect to M (t) . In addition, we define M : V → P {0} for the case without edge sampling such that M(v; w) =    0, if max n∈N (v) ϕ(⟨ w , x n ⟩) ≤ 0 argmax {xn|n∈N (v)} ϕ(⟨ w , x n ⟩), otherwise . ( ) Definition of lucky neuron. We call a neuron is the lucky neuron at iteration t if and only if its weights vector in {w (t) k } K k=1 satisfies M p (v; w (t) k ) ≡ p + for any v ∈ V ( ) or its weights vector in {u (t) k } K k=1 satisfies M p (v; u (t) k ) ≡ p -for any v ∈ V. ( ) Let W(t), U(t) be the set of the lucky neuron at t-th iteration such that W(t) = {k | M (t) p (v; w (t) k ) = p + for any v ∈ V}, U(t) = {k | M (t) p (v; u (t) k ) = p -for any v ∈ V} All the other other neurons, denoted as W c (t) and U c (t), are the unlucky neurons at iteration t. Compared with the definition of "lucky neuron" in Proposition 1 in Section 3.4, we have ∩ T t=1 W(t) = K + . From the contexts below (see Lemma 6), one can verify that K + = ∩ T t=1 W(t) = W(0). Gradient of the lucky neuron and unlucky neuron. We can rewrite the gradient descent in (4) as ∂ f (t) D ∂w (t) k = - 1 |D| v∈D ∂g(W (t) , U (t) ; X N (t) (v) ) ∂w (t) k = - 1 2|D + | v∈D+ ∂g(W (t) , U (t) ; X N (t) (v) ) ∂w (t) k + 1 2|D -| v∈D- ∂g(W (t) , U (t) ; X N (t) (v) ) ∂w (t) k , where D + and D -stand for the set of nodes in D with positive labels and negative labels, respectively. According to the definition of M (t) in ( 22), it is easy to verify that ∂g(W (t) , U (t) ; X N (v) ) ∂w (t) k = M (t) (v; w (t) k ), and the update of w k is w (t+1) k = w (t) k + c η • E v∈D y v • M (t) (v; w (t) k ), or w (t+1) k = w (t) k + c η • E v∈D+ M (t) (v; w (t) k ) -c η • E v∈D-M (t) (v; w (t) k ), where we abuse the notation E v∈S to denote E v∈S f (v) = 1 |S| v∈S f (v) for any set S and some function f . Additionally, without loss of generality, the neuron that satisfies max p∈P ⟨ w k , p ⟩ < 0 is not considered because (1) such neuron is not updated at all; (2) the probability of such neuron is negligible as 2 -L . Finally, as the focus of this paper is order-wise analysis, some constant numbers may be ignored in part of the proofs. In particular, we use h 1 (L) ≳ h 2 (L) to denote there exists some positive constant C such that h 1 (L) ≥ C • h 2 (L) when L ∈ R is sufficiently large. Similar definitions can be derived for h 1 (L) ≂ h 2 (L) and h 1 (L) ≲ h 2 (L).

D USEFUL LEMMAS

Lemma 1 indicates the relations of the number of neurons and the fraction of lucky neurons. When the number of neurons in the hidden layer is sufficiently large as (31), the fraction of lucky neurons is at least (1 -ε K -Lσ/π)/L from (32), where σ is the noise level, and L is the number of patterns. Lemma 1. Suppose the initialization {w (0) k } K k=1 and {u (0) k } K k=1 are generated through i.i.d. Gaussian. Then, if the number of neurons K is large enough as K ≥ ε -2 K L 2 log q, ( ) the fraction of lucky neuron, which is defined in (24) and (25), satisfies ρ ≥ (1 -ε K - Lσ π ) • 1 L (32) Lemmas 2 and 3 illustrate the projections of the weights for a lucky neuron in the direction of class relevant patterns and class irrelevant patterns. Lemma 2. For lucky neuron k ∈ W(t), let w (t+1) k be the next iteration returned by Algorithm 1. Then, the neuron weights satisfy the following inequality: 1. In the direction of p + , we have ⟨ w (t+1) k , p + ⟩ ≥ ⟨ w (t) k , p + ⟩ + c η α -σ (1 + r 2 ) log q |D| ; Table 2: Important notations of sets [Z], Z ∈ N+ The set of {1, 2, 3, • • • , Z} V The set of nodes in graph G E The set of edges in graph G

P

The set of class relevant and class irrelevant patterns

K+

The set of lucky neurons with respect to W (0)

K-

The set of lucky neurons with respect to U (0) K β+ (t) The set of unpruned neurons with respect to W (t) K β-(t) The set of unpruned neurons with respect to U (t)

PN

The set of class irrelevant patterns

D

The set of training data

D+

The set of training data with positive labels

D-

The set of training data with negative labels N (v), v ∈ V The neighbor nodes of node v (including v itself) in graph G N (t) s (v), v ∈ V The sampled nodes of node v at iteration t W(t) The set of lucky neurons with respect to weights W (t) at iteration t

U(t)

The set of lucky neurons with respect to weights U (t) at iteration t W c (t) The set of unlucky neurons with respect to weights W (t) at iteration t U c (t) The set of unlucky neurons with respect to weights U (t) at iteration t 2. In the direction of p -or class irrelevant patterns such that for any p ∈ P/p + , we have ⟨ w (t+1) k , p ⟩ -⟨ w (t) k , p ⟩ ≥ -c η -σ -c η • σ (1 + r 2 ) log q |D| , ⟨ w (t+1) k , p ⟩ -⟨ w (t) k , p ⟩ ≤ c η • (1 + σ) • (1 + r 2 ) log q |D| . Lemma 3. For lucky neuron k ∈ U(t), let u (t+1) k be the next iteration returned by Algorithm 1. Then, the neuron weights satisfy the following inequality: 1. In the direction of p -, we have ⟨ u (t+1) k , p -⟩ ≥ ⟨ u (t) k , p -⟩ + c η α -σ (1 + r 2 ) log q |D| ; 2. In the direction of p + or class irrelevant patterns such that for any p ∈ P/p -, we have ⟨ u (t+1) k , p ⟩ -⟨ u (t) k , p ⟩ ≥ -c η -σ -c η • σ (1 + r 2 ) log q |D| , ⟨ u (t+1) k , p ⟩ -⟨ u (t) k , p ⟩ ≤ c η • (1 + σ) • (1 + r 2 ) log q |D| . Lemmas 4 and 5 show the update of weights in an unlucky neuron in the direction of class relevant patterns and class irrelevant patterns. be the next iteration returned by Algorithm 1. Then, the neuron weights satisfy the following inequality. 1. In the direction of p + , we have ⟨ w (t+1) k , p + ⟩ ≥ ⟨ w (t) k , p + ⟩ -c η • σ (1 + r 2 ) log q |D| ; 2. In the direction of p -, we have ⟨ w (t+1) k , p -⟩ -⟨ w (t) k , p -⟩ ≥ -c η -σ -c η • σ (1 + r 2 ) log q |D| , ⟨ w (t+1) k , p -⟩ -⟨ w (t) k , p -⟩ ≤ c η • σ • (1 + r 2 ) log q |D| ; 3. In the direction of class irrelevant patterns such that any p ∈ P N , we have ⟨ w (t+1) k , p ⟩ -⟨ w (t) k , p ⟩ ≤ c η • (1 + σ) (1 + r 2 ) log q |D| . Lemma 5. For an unlucky neuron k ∈ U c (t), let u (t+1) k be the next iteration returned by Algorithm 1. Then, the neuron weights satisfy the following inequality. 1. In the direction of p -, we have ⟨ u (t+1) k , p -⟩ ≥ ⟨ u (t) k , p -⟩ -c η • σ (1 + r 2 ) log q |D| ; 2. In the direction of p + , we have ⟨ u (t+1) k , p + ⟩ -⟨ u (t) k , p + ⟩ ≥ -c η -σ -c η • σ (1 + r 2 ) log q |D| , and ⟨ u (t+1) k , p + ⟩ -⟨ u (t) k , p + ⟩ ≤ c η • σ • (1 + r 2 ) log q |D| ; 3. In the direction of class irrelevant patterns such that any p ∈ P N , we have ⟨ u (t+1) k , p ⟩ -⟨ u (t) k , p ⟩ ≤ c η • (1 + σ) • (1 + r 2 ) log q |D| . Lemma 6 indicates the lucky neurons at initialization are still the lucky neuron during iterations, and the number of lucky neurons is at least the same as the the one at initialization. Lemma 6. Let W(t), U(t) be the set of the lucky neuron at t-th iteration in (26). Then, we have W(t) ⊆ W(t + 1), U(t) ⊆ U(t + 1) if the number of samples |D| ≳ α -2 (1 + r 2 ) log q. Lemma 7 shows that the magnitudes of some lucky neurons are always larger than those of all the unlucky neurons. Lemma 7. Let {W (t ′ ) } T ′ t ′ =1 be iterations returned by Algorithm 1 before pruning. Then, let W c (t ′ ) be the set of unlucky neuron at t ′ -th iteration, then we have ⟨ w (t ′ ) k1 , w (t ′ ) k1 ⟩ < ⟨ w (t ′ ) k2 , w (t ′ ) k2 ⟩ for any k 1 ∈ W(t ′ ) and k 2 ∈ W c (0). Lemma 8 shows the moment generation bound for partly dependent random variables, which can be used to characterize the Chernoff bound of the graph structured data. Lemma 8 (Lemma 7, Zhang et al. (2020a) ). Given a set of X = {x n } N n=1 that contains N partly dependent but identical distributed random variables. For each n ∈ [N ], suppose x n is dependent with at most d X random variables in X (including x n itself), and the moment generate function of x n satisfies E xn e sxn ≤ e Cs 2 for some constant C that may depend on the distribution of x n . Then, the moment generation function of N n=1 x n is bounded as E X e s N n=1 xn ≤ e Cd X N s 2 . ( )

E PROOF OF MAIN THEOREM

In what follows, we present the formal version of the main theorem (Theorem 2) and its proof. Theorem 2. Let ε N ∈ (0, 1) and ε K ∈ (0, 1 -σL π ) be some positive constant. Then, suppose the number of training samples satisfies |D| ≳ ε -2 N • α -2 • (1 -β) 2 • (1 + σ) 2 • 1 -ε K - σL π -2 • (1 + r 2 ) • L 2 • log q, ( ) and the number of neurons satisfies K ≳ ε -2 K L 2 log q, ( ) for some positive constant q. Then, after T number of iterations such that T ≳ c η • (1 -β) • L α • (1 -ε N -σ) • (1 -ε K -σL/π) , the generalization error function in (3) satisfies I g(W (T ) , U (T ) ) = 0 (39) with probability at least 1 -q -C for some constant C > 0. Proof of Theorem 2. Let K β+ and K β-be the indices of neurons with respect to M + ⊙ W and M -⊙ U , respectively. Then, for any node v ∈ V with label y v = 1, we have g(M + ⊙ W (t) , M -⊙ U (t) ; v) = 1 |K β+ | k∈K β+ max n∈N (v) ϕ ⟨ w (t) k , x n ⟩ - 1 |K β-| k∈K β- max n∈N (v) ϕ ⟨ u (t) k , x n ⟩ ≥ 1 |K β+ | k∈K β+ max n∈N (v) ϕ ⟨ w (t) k , x n ⟩ - 1 |K β-| k∈K β- max n∈N (v) |⟨ u (t) k , p n ⟩| - 1 |K β-| k∈K β- max n∈N (v) |⟨ u (t) k , z n ⟩| = 1 |K β+ | k∈W(t) ⟨ w (t) k , p + + z n ⟩ + 1 |K β+ | k∈K β+ /W(t) max n∈N (v) ϕ ⟨ w (t) k , x n ⟩ - 1 |K β-| k∈K β- max n∈N (v) |⟨ u (t) k , p n ⟩| - 1 |K β-| k∈K β- max n∈N (v) |⟨ u (t) k , z n ⟩| ≥ 1 |K β+ | k∈W(t) ⟨ w (t) k , p + + z n ⟩ - 1 |K β-| k∈K β- max n∈N (v) |⟨ u (t) k , p n ⟩| - 1 |K β-| k∈K β- max n∈N (v) |⟨ u (t) k , z n ⟩|, where W(t) is the set of lucky neurons at iteration t. Then, we have 1 |K β+ | k∈W(t) ⟨ w (t) k , p + + z n ⟩ ≥ 1 -σ |K β+ | k∈W(0) ⟨ w (t) k , p + ⟩ ≥ 1 -σ |K β+ | • |W(0)| • c η • α - (1 + r 2 ) log q |D| • t ≳ 1 |K β+ | • |W(0)| • c η • α • t (41) where the first inequality comes from Lemma 6, the second inequality comes from Lemma 2. On the one hand, from Lemma 2, we know that at least (1 -ε K -σL π )K neurons out of {w (0) k } K k=1 are lucky neurons. On the other hand, we know that the magnitude of neurons in W(0) before pruning is always larger than all the other neurons. Therefore, such neurons will not be pruned after magnitude based pruning. Therefore, we have |W(0)| = (1 -ε K - σL π )K = (1 -ε K - σL π )|K + |/(1 -β) Hence, we have 1 |K β+ | k∈W(t) ⟨ w (t) k , p + ⟩ ≳(1 -ε K - σL π ) • 1 (1 -β)L • α • t. In addition, X N (v) does not contain p -when y v = 1. Then, from Lemma 3 and Lemma 5, we have 1 |K β-| k∈K β- max n∈N (v) |⟨ u (t) k , p n ⟩| ≲ c η (1 + σ) (1 + r 2 ) log q |D| t ≲ ε N • (1 -ε K - σL π ) • α • 1 (1 -β)L t, where the last inequality comes from (36). Moreover, we have 1 |K β-| k∈K β- max n∈N (v) |⟨ u (t) k , z n ⟩| ≤ E k∈K β-∥u (t) k ∥ 2 • ∥z n ∥ 2 ≤ σ • E k∈K β- p∈P/p- ⟨ u (t) k , p ⟩ ≲ σ • c η • L(1 + σ) (1 + r 2 ) log q |D| • t. Combining ( 43), ( 44), and (45), we have g(W (T ) , U (T ) ; v) ≥ (1 -ε N )(1 -ε K -σL/π)(1 -σL) α (1 -β)L • c η • T. Therefore, when the number of iterations satisfies T > c η • (1 -β) • L α • (1 -ε N -σ) • (1 -ε K -σL/π) • (1 -σL) , we have g(W (t) , U (t) ; v) > 1. Similar to the proof of ( 46), for any v ∈ V with label y v = -1, we have g(M + ⊙ W (T ) , M -⊙ U (T ) ; v) = 1 |K β+ | k∈K β+ max n∈N (v) ϕ ⟨ w (T ) k , x n ⟩ - 1 |K β-| k∈K β- max n∈N (v) ϕ ⟨ u (T ) k , x n ⟩ ≤ - 1 |K β-| k∈K β- max n∈N (v) ϕ ⟨ w (T ) k , x n ⟩ + 1 |K β+ | k∈K β+ max n∈N (v) |⟨ u (T ) k , p n ⟩| + 1 |K β+ | k∈K β- max n∈N (v) |⟨ u (T ) k , z n ⟩| = - 1 |K β-| k∈W(T ) ⟨ w (T ) k , p -+ z n ⟩ - 1 |K β-| k∈K β-/W(t) max n∈N (v) ϕ ⟨ w (T ) k , x n ⟩ + 1 |K β+ | k∈K β+ max n∈N (v) |⟨ u (T ) k , p n ⟩| + 1 |K β+ | k∈K β+ max n∈N (v) |⟨ u (T ) k , z n ⟩| ≥ - 1 |K β-| k∈W(T ) ⟨ w (t) k , p -+ z n ⟩ + 1 |K β+ | k∈K β+ max n∈N (v) |⟨ u (t) k , p n ⟩| + 1 |K β+ | k∈K β+ max n∈N (v) |⟨ u (t) k , z n ⟩|, ≤ - 1 |K β-| |U(T )| • c η • αT +c η • (1 + σ) • T • (1 + r 2 ) log q |D| + c η • σ • (1 + σ) • (1 + r 2 ) log q |D| • T ≲ -(1 -ε N )(1 -ε K -σL/π) • (1 -σL) • α (1 -β)L ≤ -1. (48) Hence, we have g(W (t) , U (t) ; v) < -1 for any v with label y v = -1. In conclusion, the generalization function in (3) achieves zero when conditions ( 36), (37), and (38) hold.

F NUMERICAL EXPERIMENTS F.1 IMPLEMENTATION OF THE EXPERIMENTS

Generation of the synthetic graph structured data. The synthetic data used in Section 4 are generated in the following way. First, we randomly generate three groups of nodes, denoted as V + , V -, and V N . The nodes in V + are assigned with noisy p + , and the nodes in V -are assigned with noisy p -. The patterns of nodes in V N are class-irrelevant patterns. For any node v in V N , v will contact to some nodes in either V + or V -uniformly. If the node connects to nodes in V + , then its label is +1. Otherwise, its label is -1. Finally, we will add random connections among the nodes within V + , V -or V N . To verify our theorems, each node in V N will connect to exactly one node in V + or V -, and the degree of the nodes in V N is exactly M by randomly selecting M -1 other nodes in V N . We use full batch gradient descent for synthetic data. Implementation of importance sampling on edges. We are given the sampling rate of importance edges α and the number of sampled neighbor nodes r. First, we sample the importance edge for each node with the rate of α. Then, we randomly sample the remaining edges without replacement until the number of sampled nodes reaches r. Implementation of magnitude pruning on model weights. The pruning algorithm follows exactly the same as the pseudo-code in Algorithm 1. The number of iterations T ′ is selected as 5. Illustration of the error bars. In the figures, the region in low transparency indicates the error bars. The upper envelope of the error bars is based on the value of mean plus one standard derivation, and the lower envelope of the error bars is based on the value of mean minus one standard derivation.

F.2 EMPIRICAL JUSTIFICATION OF DATA MODEL ASSUMPTIONS

In this part, we will use Cora dataset as an example to demonstrate that our data assumptions can model some real application scenarios. Cora dataset is a citation network containing 2708 nodes, and nodes belong to seven classes, namely, "Neural Networks", "Rule Learning", "Reinforcement Learning", "Probabilistic Methods", "Theory", "Genetic Algorithms", and "Case Based". For the convenience of presentation, we denote the labels above as "class 1" to "class 7", respectively. For each node, we calculate the aggregated features vector by aggregating its 1-hop neighbor nodes, and each feature vector is in a dimension of 1433. Then, we construct matrices by collecting the aggregated feature vectors for nodes with the same label, and the largest singular values of the matrix with respect to each class can be found in Table 4 . Then, the cosine similarity, i.e., arccos <z1,z2> ∥z1∥2•∥z2∥2 , among the first principal components, which is the right singular vector with the largest singular value, for different classes is provided, and results are summarized in Table 5 . From Table 4 , we can see that the collected feature matrices are all approximately rank-one. In addition, from Table 5 , we can see that the first principal components of different classes are almost orthogonal. For instance, the pair-wise angles of feature matrices that correspond to "class 1", "class 2", and "class 3" are 89.3, 88.4, and 89.2, respectively. Table 4 indicates that the features of the nodes are highly concentrated in one direction, and Table 5 indicates that the principal components for different classes are almost orthogonal and independent. Therefore, we can view the primary direction of the feature matrix as the class-relevant features. If the node connects to class-relevant features of two classes frequently, the feature matrix will have at least two primary directions, which leads to a matrix with a rank of two or higher. If the nodes in one class connect to class-relevant features for two classes frequently, the first principal component for this class will be a mixture of at least two class-relevant features, and the angles among the principal components for different classes cannot be almost orthogonal. Therefore, we can conclude that most nodes in different classes connect to different class-relevant features, and the class-relevant are almost orthogonal to each other. In addition, following the same experiment setup above, we implement another numerical experiment on a large-scale dataset Ogbn-Arxiv to further justify our data model. The cosine similarity between the estimated class-relevant features for the first 10 classes are summarized in Table 6 . As we can see, most of the angles are between 65 to 90, which suggests a sufficiently large distance between the class-relevant patterns for different classes. Moreover, to justify the existence of node features in V + (V -) and V N + (V N -), we have included a comparison of the node features and the estimated class-relevant features from the same class. Figure 12 illustrates the cosine similarity between the node features and the estimated class-relevant features from the same class. As we can see, the node with an angle smaller than 40 can be viewed as V + (V -), and the other nodes can be viewed as V N + (V N -), which verifies the existence of class-relevant and class-irrelevant patterns. Similar results to ours that the node embeddings are distributed in a small space are also observed in (Pan et al., 2018) via clustering the node features.

F.3 ADDITIONAL EXPERIMENTS ON SYNTHETIC DATA

Figure 13 shows the phase transition of the sample complexity when K changes. The sample complexity is almost a linear function of 1/K. Figure 14 shows that the sample complexity increases as a linear function of σ 2 , which is the noise level in the features. In Figure 15 , α is fixed at 0.8 while increasing the number of sampled neighbors r. The sample complexity is linear in r 2 . Figure 16 illustrates the required number of iterations for different number of sampled edges, and R = 30. The fitted curve, which is denoted as a black dash line, is a linear function of α -1 for α = r/R. We can see that the fitted curve matches the empirical results for α = r/R, which verifies the bound in (7). Also, applying importance sampling, the number of iterations is significantly reduced with a large α. Figure 17 illustrates the required number of iterations for convergence with different pruning rates β. All the results are averaged over 100 independent trials. The black dash line stands for the baseline, the average number of iterations of training original dense networks. The blue line with circle marks is the performance of magnitude pruning of neuron weights. The number of iterations is almost a linear function of the pruning rate, which verifies our theoretical findings in ( 7). The red line with star Angle between the node feature and the estimated class-relevant feature from the same class Further, we justify our theoretical characterization in Cora data. Compared with synthetic data, α is not the sampling rate of edges but the sampling rate of class-relevant features, which is unknown for real datasets. Also, there is no standard method for us to define the success of the experiments for training practical data. Therefore, we make some modifications in the experiments to fit into our theoretical framework. We utilize the estimated class-relevant feature from Appendix F.2 to determine whether the node is class-relevant or not, i.e., if we call the node feature as a class-relevant feature if the angle between the node feature and class-relevant feature is smaller than 30. For synthetic data, we define the success of a trial if it achieves zero generalization error. Instead, for practical data, we call the trial is success if the test accuracy is larger than 80%. Figure 19 illustrates the sample complexity against α -2 , and the curve of the phrase transition is almost a line, which justify our 

F.4 ADDITIONAL EXPERIMENTS ON RELAXED DATA ASSUMPTIONS

In the following numerical experiments, we relax the assumption (A1) by adding extra edges between (1) V + and V N -, (2) V -and V N+ , and (3) V + and V -. We randomly select γ fraction of the nodes that it will connect to a node in V -(or V + ) if its label is positive (or negative). We call the selected nodes "outlier nodes", and the other nodes are denoted as "clean nodes". Please note that two identical nodes from the set of "outlier nodes" can have different labels, which suggests that one cannot find any mapping from the "outlier nodes" to its label. Therefore, we evaluate the generalization only on these "clean nodes" but train the GNN on the mixture of "clean nodes" and "outlier nodes". Figure 23 illustrates the phase transition of the sample complexity when γ changes. The number of sampled edges r = 20, pruning rate β = 0.2, the data dimension d = 50, and the number of patterns L = 200. We can see that the sample complexity remains almost the same when γ is smaller than 0.3, which indicates that our theoretical insights still hold with a relaxed assumption (A1). Figure 26 show that both edge sampling using UGS and magnitude-based neuron pruning can reduce the test error on Cora, Citeseer, and Pubmed datasets, which justify our theoretical findings that joint sparsification improves the generalization. In comparison, random pruning degrades the performance with large test errors than the baseline.

𝒱𝒱 + 𝒱𝒱 𝑁𝑁+ 𝒱𝒱 - 𝒱𝒱 𝑁𝑁-

Figures algorithms. The Ogbn-proteins dataset is an undirected graph with 132,534 nodes and 39,561,252 edges. Nodes represent proteins, and edges indicate different types of biologically meaningful associations between proteins. All edges come with 8-dimensional features, where each dimension represents the approximate confidence of a single association type and takes values between 0 and 1. The Ogbn-Arxiv dataset is a citation network with 169,343 nodes and 1,166,243 edges. Each node is an arXiv paper and each edge indicates that one paper cites another one. Each node comes with a 128-dimensional feature vector obtained by averaging the embeddings of words in its title and abstract. The test errors of training the ResGCN on the datasets above using joint edge-model sparsitification are summarized in Figure 27 . We can see that the joint model sparsification can improve the generalization error, which have justified our theoretical insights of joint sparsification in reducing the sample complexity. ( 

G THE VC DIMENSION OF THE GRAPH STRUCTURED DATA AND TWO-LAYER GNN

Following the framework in Scarselli et al. (2018) , we define the input data as (G v , v), where G v is the sub-graph with respect to node v. The input data distribution considered in this paper is denoted as G × V. To simplify the analysis, for any (G v , v) ∈ V, N (v) includes exactly one feature from {p + , p -}, and all the other node features comes from P N . Specifically, we summarize the VC dimension of the GNN model as follows. Definition 1 (Section 4.1, Scarselli et al. (2018) ). Let g be the GNN model and data set D be from G × V. g is said to shatter D if for any set of binary assignments y ∈ {0, 1} |D| , there exist parameters Θ such that g(Θ; (G v , v)) > 0 if y v = 1 and g(Θ; (G v , v)) < 0 if y v = 0. Then, the VC-dim of the GNN is defined as the size of the maximum set that can be shattered, namely, VC-dim(g) = max D is shattered by g |D|. ( ) Based on the definition above, we derive the VC-dimension of the GNN model over the data distribution considered in this paper, which is summarized in Theorem 3. Theorem 3 shows that the VC-dimension of the GNN model over the data distribution is at least 2 L/2-1 , which is an exponential function of L. Inspired by Brutzkus & Globerson (2021) for CNNs, the major idea is to construct a special dataset that the neural network can be shattered. Here, we extend the proof by constructing a set of graph structured data as in (51). Then, for any combination of labels, from (54), one can construct a solvable linear system that can fit such labels. Theorem 3 (VC-dim). Suppose the number of orthogonal features is L, then the VC dimension of the GNN model over the data G × V satisfies VC-dim(H GNN (G × V)) ≥ 2 L 2 -1 . Proof. Without loss of generality, we denote the class irrelevant feature as {p i } L-2 i=1 . Therefore, P = {p i } L-2 i=1 p + p -. Let us define a set J such that J = {J ∈ {0, 1} L/2-1 | J i = 0 or 1, 1 ≤ i ≤ L/2 -1}. Then, we define a subset D ∈ G × V with the size of 2 L/2-1 based on the set J . For any (G v , v), we consider the graph structured data G v that connects to exactly L/2 -1 neighbors. Recall that for each node v, the feature matrix X N (v) must contain either p + or p -, we randomly pick one node u ∈ N (v), and let x u be p + or p -. Next, we sort all the other nodes in N (v) and label them as {v i } L/2-1 i=1 . Then, given an element J ∈ J , we let the feature of node v i as x vi = J i p 2i-1 + (1 -J i )p 2i , where J i is the entry J . we denote such constructed data (G v , v) as D J . Therefore, the data set D is defined as D = {D J | J ∈ J }. Next, we consider the GNN model in the form of ( 20) with K = 2 L/2-1 . Given the label y J for each data D J ∈ D, we obtain a group of α J ∈ R by solving the following equations J ′ ∈J /J † α J ′ = y J , ∀J ∈ J , where J † = 1 -J . We can see that ( 53) can be re-written as        0 1 • • • 1 1 0 • • • 1 . . . . . . . . . . . . 1 1 • • • 0        •        α 1 α 2 . . . α L/2-1        =        y 1 † y 2 † . . . y (L/2-1) †        It is easy to verify that the matrix in (54) is invertible. Therefore, we have a unique solution of {α J } J∈J given any fixed {y J } J∈J . As K is selected as 2 L/2-1 , we denote the neuron weights as {w J } J∈J and {u J } J∈J by defining w J = max{α J , 0} • v ′ ∈N (v),v∈D J x v ′ , u J = max{-α J , 0} • v ′ ∈N (v),v∈D J x v ′ . Then, for each D J ∈ D, we substitute (55) into (20): g = J ′ ∈J max v ′ ∈N (v) σ ⟨ w J ′ , x v ′ ⟩ -max v ′ ∈N (v) σ ⟨ u J ′ , x v ′ ⟩ . ( ) When α J ′ ≥ 0, we have u J ′ = 0, and max v ′ ∈N (v) σ ⟨ w J ′ , x v ′ ⟩ =α J ′ • max u ′ ∈N (u),u∈D J x u ′ v ′ ∈N (v),v∈D J x v ′ =α J ′ • max i (J i p 2i-1 + (1 -J i )p 2i ) • (J ′ i p 2i-1 + (1 -J ′ i )p 2i ) = 0, if J ′ = 1 -J α J ′ , otherwise , Lemma 10. Suppose D I = V + ∪ V -, and D U = V N , we have Eα ≥ γr R -r + γ . ( ) Corollary 10.1. For the importance sampling strategy in Lemma 10, Eα ≥ γ • r R when R ≫ r. Remark 10.1: From Corollary 10.1, we can see that the applied importance sampling in Lemma 10 can save the sample complexity by a factor of 1 γ 2 when r ≪ R on average. Lemma 10 describes the importance sampling strategy when a fraction of nodes with class-irrelevant features are assigned with a high sampling probability. From Corollary 10.1, we know that α can be improved over uniform sampling by a factor of  I = V + ∪ V -∪ V λ , where V λ ⊆ V N with |V λ | = λ|V N |. We have Eα ≥ γr 1 + (γ -1)λ R -r + γ . ( ) Corollary 11.1. For the importance sampling strategy in Lemma 11, Eα ≥ γ 1+(γ-1)λ • r R when R ≫ r. Remark 11.1: From Corollary 11.1, we can see that the applied importance sampling in Lemma 11 can save the sample complexity by a factor of 1+(γ-1)λ γ 2 when r ≪ R on average.

H PROOF OF USEFUL LEMMAS H.1 PROOF OF LEMMA 1

The major idea is to obtain the probability of being a lucky neuron as shown in (66). Compared with the noiseless case, which can be easily solved by applying symmetric properties in Brutzkus & Globerson (2021) , this paper takes extra steps to characterize the boundary shift caused by the noise in feature vectors, which are shown in (65). Proof of Lemma 1. For a random generated weights w k that belongs to Gaussian distribution, let us define the random variable i k such that i k = 1, if w k is the lucky neuron 0, otherwise . Next, we provide the derivation of i k 's distribution. Let θ 1 be the angle between p + and the initial weights w. Let {θ ℓ } L ℓ=2 be the angles between p -and p ∈ P N . Then, it is easy to verify that {θ ℓ } L ℓ=1 belongs to the uniform distribution on the interval [0, 2π]. Because p ∈ P are orthogonal to each other. Then, ⟨ w , p ⟩ with p ∈ P are independent with each other given w belongs to Gaussian distribution. It is equivalent to saying that {θ ℓ } L ℓ=1 are independent with each other. Consider the noise level σ (in the order of 1/L) is significantly smaller than 1, then the probability of a lucky neuron can be bounded as Prob θ 1 + ∆θ ≤ θ ℓ -∆θ ≤ 2π, 2 ≤ ℓ ≤ L , where ∆θ ≂ σ. Then, we have Prob θ 1 + ∆θ ≤ θ ℓ -∆θ ≤ 2π, 2 ≤ ℓ ≤ L = L ℓ=1 Prob θ 1 + ∆θ ≤ θ ℓ -∆θ ≤ 2π = 2π -θ 1 -∆θ 2π L-1 . Next, we can bound the probability as Prob(w is the lucky neuron ) = 2π 0 1 2π • 2π -θ 1 -∆θ 2π L-1 dθ 1 ≃ 1 L • 2π -2∆θ 2π L ≃ 1 L • 1 - Lσ π . We know i k belongs to Bernoulli distribution with probability 1 L (1 -Lσ π ). By Hoeffding's inequality, we know that 1 L 1 - Lσ π - C log q K ≤ 1 K K k=1 i k ≤ 1 L 1 - Lσ π + C log q K (67) with probability at least q -C . Let K = D(ε -2 K L 2 log q), we have 1 K K k=1 i k ≥ 1 -ε K - Lσ π • 1 L . ( ) To guarantee the probability in ( 68) is positive, the noise level needs to satisfy σ < π L . H.2 PROOF OF LEMMA 2 The bound of the gradient is divided into two parts in (74), where I 1 and I 2 concern the noiseless features and the noise, respectively. The term I 2 can be bounded using standard Chernoff bound, see (80) for the final result. While for I 1 , the gradient derived from D + is always in the direction of p + with a lucky neuron. Therefore, the neuron weights keep increasing in the direction of p + from (83). Also, if ⟨ w k , x v ⟩ > 0 for some y v = -1, the gradient derived from v will force w k moving in the opposite directions of x v . Therefore, the gradient derived from D -only has a negative contribution in updating w k . For any pattern existing in {p v } v∈D-, which is equivalent to say for any p ∈ P/p + , the value of ⟨ w k , p ⟩ has a upper bound, see (90). Proof of Lemma 2. For any v ∈ D + and a lucky neuron with weights w (t) k , we have M(v; w (t) k ) = p + + M z (v; w (t) k ). If the neighbor node with class relevant pattern is sampled in N (t) (v), the gradient is calculated as ∂g(W (t) , U (t) ; X N (t) (v) ) ∂w k w k =w (t) k = p + + M z (v; w (t) k ). For v ∈ D -and a lucky neuron w (t) k , we have M (t) (v; w (t) k ) ∈ {x n } n∈N (v) {0}, and the gradient can be represented as ∂g(W (t) , U (t) ; X N (t) (v) ) ∂w k w k =w (t) k =M (t) (v; w (t) k ) =M (t) p (v; w (t) k ) + M (t) z (v; w (t) k ). Definition of I 1 and I 2 . From the analysis above, we have ⟨ w (t+1) k , p ⟩ -⟨ w (t) k , p ⟩ = c η • E v∈D ⟨ y v • M (t) (v; w (t) k ) , p ⟩ = c η • E v∈D ⟨ y v • M (t) p (v; w (t) k ) , p ⟩ + E v∈D ⟨ y v • M (t) z (v; w (t) k ) , p ⟩ := c η (I 1 + I 2 ). (74) Bound of I 2 . Recall that the noise factor z v are identical and independent with v, it is easy to verify that y v and M z (v; w (t) k ) are independent. Then, I 2 in (74) can be bounded as |I 2 | = |E v∈D y v • E v∈D ⟨ M z (v; w (t) k ) , p ⟩| ≤ |E v∈D y v | • |E v∈D ⟨ M z (v; w (t) k ) , p ⟩| ≤ σ • |E v∈D y v |. (75) Recall that the number of sampled neighbor nodes is fixed r. Hence, for any fixed node v, there are at most (1 + r 2 ) (including v itself) elements in u u ∈ D are dependent with y v . Also, from assumption (A2), we have Prob(y = 1) = 1 2 and Prob(y = -1) = 1 2 . From Lemma 8, the moment generation function of v∈D (y v -Ey v ) satisfies e v∈D s(yv-Eyv) ≤ e C(1+r 2 )|D|s 2 , ( ) where Ey v = 0 and C is some positive constant. By Chernoff inequality, we have Prob v∈D (y v -E y v ) > th ≤ e C(1+r 2 )|D|s 2 e |D|•th•s for any s > 0. Let s = th/ C(1 + r 2 ) and th = (1 + r 2 )|D| log q, we have v∈D (y v -E y v ) ≤C (1 + r 2 )|D| log q (78) with probability at least 1 -q -c . From (A2) in Section 3.2, we know that |Ey v | ≲ |D|. Therefore, I 2 is bounded as |I 2 | ≤ v∈D 1 |D| y v ≤ v∈D 1 |D| (y v -E y v ) + |E y v | ≲σ (1 + r 2 ) log q |D| . Bound of I 1 when p = p + . Because the neighbors of node v with negative label do not contain p + , M p (v; w k ) in (73) cannot be p + . In addition, p + is orthogonal to {p n } n∈N (v) {0}. Then, we have p + , ∂g(W , U ; X N (v) ) ∂w k w k =w (t) k = ⟨ p + , M z (v; w (t) k ) ⟩. Let us use D (t) s to denote the set of nodes that the class relevant pattern is included in the sampled neighbor nodes. Recall that α, defined in Section 3, is the sampling rate of nodes with important edges, we have |D (t) s | = α|D + |. ( ) I 1 = ⟨ E v∈D+ M p (v; w (t) k ) -E v∈D-M p (v; w (t) k ) , p + ⟩ = ⟨ E v∈D+ M p (v; w (t) k ) , p + ⟩ ≥ ⟨ E v∈Ds M p (v; w (t) k ) , p + ⟩ ≥ α. (83) Bound of I 1 when p ̸ = p + . Next, for any p ∈ P/P+, we will prove the following equation: I 1 ≤ (1 + r 2 ) log q |D| . When ⟨ w (t+1) k , p ⟩ ≤ -σ, we have M (t) p (v; w (t) k ) ̸ = p, and ⟨ M (t) p (v; w (t) k ) , p ⟩ = 0 for any v ∈ D. Therefore, we have I 1 = ⟨ E v∈D+ M (t) p (v; w (t) k ) -E v∈D-M (t) p (v; w (t) k ) , p ⟩ = 0. When ⟨ w (t+1) k , p ⟩ > -σ, we have I 1 = ⟨ E v∈D+ M (t) p (v; w (t) k ) -E v∈D-M (t) p (v; w (t) k ) , p ⟩ = ⟨ E v∈D+/Ds M (t) p (v; w (t) k ) -E v∈D-M (t) p (v; w (t) k ) , p ⟩. Let us define a mapping H: R d -→ R d such that H(p v ) = p -if p v = p + p v otherwise , Then, we have I 1 =⟨ E v∈D+ M (t) p (v; w (t) k ) -E v∈D-M (t) p (v; w (t) k ) , p ⟩ =⟨ E v∈D+/Ds M (t) p (v; w (t) k ) -E v∈D-M (t) p (v; w (t) k ) , p ⟩ =⟨ E v∈D+/Ds M (t) p (H(v); w (t) k ) -E v∈D-M (t) p (v; w (t) k ) , p ⟩ ≤⟨ E v∈D+ M (t) p (H(v); w (t) k ) -E v∈D-M (t) p (v; w (t) k ) , p ⟩ =⟨ E v∈D+ M (t) p (H(v); w (t) k ) -E v∈D-M (t) p (H(v); w (t) k ) , p ⟩ where the third equality holds because N (t) (v) does not contain p + for v ∈ D + /D s , and H(X N (t) (v) ) = X N (t) (v) . Also, the last equality holds because x v does not contain p + for node v ∈ D -. Therefore, we have ⟨ E v∈D+ M (t) p (H(v); w (t) k ) -E v∈D-M (t) p (H(v); w (t) k ) , p ⟩ =E v∈D ⟨ y v M (t) p (H(v); w (t) k ) , p ⟩ ≤ E v∈D y v • E v∈D ⟨ M (t) p (H(v); w (t) k ) , p ⟩ ≤ (1 + r 2 ) log q |D| (90) with probability at least 1 -q -C for some positive constant. Proof of statement 1. From ( 80) and ( 83), the update of w k in the direction of p + is bounded as ⟨ w (t+1) k , p + ⟩ -⟨ w (t) k , p + ⟩ ≥ c η (I 1 -|I 2 |) ≥ c η • (α -σ (1 + r 2 ) log q |D| ). Then, we have ⟨ w (t) k , p + ⟩ -⟨ w (0) k , p + ⟩ ≥ c η • (α -σ (1 + r 2 ) log q |D| ) • t. Proof of statement 2. From ( 80) and ( 90), the update of w k in the direction of p ∈ P/p + is upper bounded as ⟨ w (t+1) k , p ⟩ -⟨ w (t) k , p ⟩ ≤ c η (I 1 + |I 2 |) ≤ c η • (1 + σ) • (1 + r 2 ) log q |D| . Therefore, we have ⟨ w (t) k , p ⟩ -⟨ w (0) k , p ⟩ ≤ c η • (1 + σ) • (1 + r 2 ) log q |D| • t. The update of w k in the direction of p ∈ P/p + is lower bounded as ⟨ w (t+1) k , p ⟩ -⟨ w (t) k , p ⟩ ≥ c η (I 1 -|I 2 |) ≥ -c η • 1 + σ • (1 + r 2 ) log q |D| . To derive the lower bound, we prove the following equation via mathematical induction: ⟨ w (t) , p ⟩ ≥ -c η 1 + σ + σt • (1 + r 2 ) log q |D| . It is clear that (96) holds when t = 0. Suppose (96) holds for t. When ⟨ w  Therefore, we have I 1 = ⟨ E v∈D+ M (t) p (v; w (t) k ) -E v∈D-M (t) p (v; w (t) k ) , p ⟩ = 0, (t+1) k , p ⟩ = ⟨ w (t) k , p ⟩ + c η • (I 1 + I 2 ) = ⟨ w (t) k , p ⟩ + c η • I 2 ≥ -c η • 1 + σ c η + σt • (1 + r 2 ) log q |D| -c η (1 + r 2 ) log q |D| = -c η • 1 + σ c η + σ(t + 1) • (1 + r 2 ) log q |D| . ( ) When ⟨ w (t) k , p ⟩ > -σ, we have ⟨ w (t+1) k , p ⟩ = ⟨ w (t) k , p ⟩ + c η (I 1 + I 2 ) = ⟨ w (t) k , p ⟩ + c η • ⟨ E v∈D+ M (t) p (v; w (t) k ) -E v∈D-M (t) p (v; w (t) k ) , p ⟩ + c η I 2 = ⟨ w (t) k , p ⟩ -c η • ⟨ E v∈D+ M (t) p (v; w (t) k ) , p ⟩ + c η I 2 ≥ -σ -c η -c η σ (1 + r 2 ) log q |D| . Therefore, from ( 99) and ( 100), we know that (96) holds for t + 1.

H.3 PROOF OF LEMMA 4

The bound of the gradient is divided into two parts in (101), where I 3 and I 4 are respect to the noiseless features and the noise, respectively. The bound for I 4 is similar to that for I 2 in proving Lemma 2. However, for a unlucky neuron, the gradient derived from D + is not always in the direction of p + . We need extra techniques to characterize the offsets between D + and D -. One critical issue is to guarantee the independence of M (t) (v) and y v , which is solved by constructing a matched data, see (108) for the definition. We show that the magnitude of w (t) k scale in the order of (1 + r 2 )/|D| from (113). Proof of Lemma 4. Definition of I 1 and I 2 . Similar to (74), we define the items I 3 and I 4 as ⟨ w (t+1) k , p ⟩ -⟨ w (t) k , p ⟩ = c η • E v∈D ⟨ y v • M (t) p (v; w (t) k ) , p ⟩ + E v∈D ⟨ y v • M (t) z (v; w (t) k ) , p ⟩ := c η • (I 3 + I 4 ). (101) Bound of I 4 . Following the similar derivation of (80), we can obtain |I 4 | ≲ σ (1 + r 2 ) log q |D| . ( ) Bound of I 3 when p = p + . From (101), we have I 3 = ⟨ E v∈D+ M (t) p (v; w (t) k ) -E v∈D-M (t) p (v; w (t) k ) , p + ⟩ = ⟨ E v∈D+ M (t) p (v; w (104) Therefore, S k (t + 1) = ∅ and S k (t ′ ) = ∅ for all t ′ ≥ t. Second, when S k (t) ̸ = ∅, then we have I 3 = - |S k (t)| |D -| ∥p -∥ 2 . ( ) Note that if ⟨ w (t) k , p -⟩ < -σ, then S k (t) must be an empty set. Therefore, after at most t 0 = ∥w (0) k ∥2+σ cη number of iterations, S k (t 0 ) = ∅, and S k (t) = ∅ with t ≥ t 0 from (104). Next, for some large t 0 and any v ∈ V, we define a mapping F : V -→ R rd such that F(v) = o ⊤ v1 o ⊤ v2 • • • o ⊤ vr ⊤ . From (101), we have (109) I 4 = E v∈V ⟨ y v M (t) p (v; w (t) k ) , p ⟩, When y = -1 and t ≥ t 0 , recall that S k (t) = D -, then we have ⟨ M (t) p (x) , p ⟩ = ⟨ M (t) p (F(x)) , p ⟩. (110) Combining ( 109) and ( 110), we have I 3 ≤E v∈V y • ⟨ M (t) p (F(v)) , p ⟩ . ( ) From assumptions (A2) and ( 17), we know that the distribution of p + and p -are identical, and the distributions of any class irrelevant patterns p ∈ P N are independent with y. Therefore, it is easy to verify that y and F(v) are independent with each other, then we have I 3 ≤E v∈D y v • E v ⟨ M(F(v)) , p ⟩. From ( 112) and ( 78), we have I 3 ≤ (1 + r 2 ) log q |D| ( ) for any p ∈ P N . Proof of statement 1. From ( 102) and ( 103), we have ⟨ w (117) Therefore, we have ⟨ w (t) k , p -⟩ -⟨ w (0) k , p ⟩ ≤ c η • σ • (1 + r 2 ) log q |D| • t. ( ) To derive the lower bound, we prove the following equation via mathematical induction: ⟨ w (t) , p -⟩ ≥ -c η 1 + σ + σt • (1 + r 2 ) log q |D| . ( ) It is clear that (119) holds when t = 0. Suppose (119) holds for t. When ⟨ w  k , p -⟩ + c η • I 4 ≥ -c η • 1 + σ c η + σt • (1 + r 2 ) log q |D| -c η (1 + r 2 ) log q |D| = -c η • 1 + σ c η + σ(t + 1) • (1 + r 2 ) log q |D| . When ⟨ w (t) k , p -⟩ > -σ, we have ⟨ w (t+1) k , p -⟩ = ⟨ w (t) k , p -⟩ + c η (I 3 + I 4 ) = ⟨ w (t) k , p -⟩ + c η • ⟨ E v∈D+ M (t) p (v; w (t) k ) -E v∈D-M (t) p (v; w (t) k ) , p -⟩ + c η I 4 = ⟨ w (t) k , p -⟩ -c η • ⟨ E v∈D+ M (t) p (v; w (t) k ) , p -⟩ + c η I 4 ≥ -σ -c η -c η σ (1 + r 2 ) log q |D| . Therefore, from ( 122) and ( 123), we know that (119) holds for t + 1. Proof of statement 3. From ( 102) and ( 113 (1 + r 2 ) log q |D| • t. H.4 PROOF OF LEMMA 6 The major idea is to show that the weights of lucky neurons will keep increasing in the direction of class-relevant patterns. Therefore, the lucky neurons consistently select the class-relevant patterns in the aggregation function. Proof of Lemma 6. For any k ∈ U(0), we have ⟨ w (1 + r 2 ) log q |D| • t. (127) Combining ( 126) and ( 127), we have ⟨ w (t) k , p + ⟩ - 1 + σ 1 -σ ⟨ w (t) k , p ⟩ ≳ c η • α -(1 + 4σ) • (1 + r 2 ) log q |D| • t. When |D| ≳ α -2 • (1 + r 2 ) log q, we have ⟨ w (t) k , p + ⟩ ≥ 1+σ 1-σ ⟨ w (t) k , p ⟩. Therefore, we have k ∈ W(t) and W(0) ⊆ W(t). (129) One can derive the proof for U(t) following similar steps above. H.5 PROOF OF LEMMA 7 The proof is built upon the statements of the lucky neurons and unlucky neurons in Lemmas 2 and 4. The magnitude of the lucky neurons in the direction of p + is in the order of α, while the magnitude of unlucky neurons is at most 1/ |D|. Given a sufficiently large |D|, the magnitude of a lucky neuron is always larger than that of an unlucky neuron. Proof of Lemma 7. From Lemma 6, we know that (1) if w (0) k1 is lucky neuron, w k1 is still lucky neuron for any t ′ ≥ 0; (2) if w (t ′ ) k1 is unlucky neuron, w (t ′′ ) k1 is still unlucky neuron for any t ′′ ≤ t ′ For any k 1 ∈ W(0), from Lemma 2, we have ⟨ w  (t ′ ) k1 , w (t ′ ) k1 ⟩ 1 2 ≥⟨ w (t ′ ) k1 , p + ⟩ ≥c η • α -σ (1 + r 2 ) log q |D| t ′ . , p ⟩ ≤L • c η • (1 + σ) • (1 + r 2 ) log q |D| • t ′ . Therefore, ⟨ w (t ′ ) k2 , w (t ′ ) k2 ⟩ 1 2 < ⟨ w (t ′ ) k1 , w (t ′ ) k1 ⟩ 1 2 if |D| is greater than α -2 (1 + r 2 )L 2 log q. H.6 PROOF OF LEMMA 8 Proof of Lemma 8 . According to the Definitions in Janson (2004) , there exists a family of {(X j , w j )} j , where X j ⊆ X and w j ∈ [0, 1], such that j w j xn j ∈Xj x nj = N n=1 x n , and j w j ≤ d X by equations (2.1) and (2.2) in Janson (2004) . Then, let p j be any positive numbers with j p j = 1. By Jensen's inequality, for any s ∈ R, we have Let W ∈ R d×K and B ∈ R 2×K be the collections of w k 's and b k 's, respectively. Then, the output of the graph neural network, denoted as g ∈ R 2 , is calculated as g(W , B; X N (v) ) = 1 K K k=1 b k • AGG(X N (v) , w k ). Then, the label generated by the graph neural network is written as y est = sign(g). ( 138) Given the set of data D, we divide them into four groups as 



The analysis can be extended to multi-class classification, see Appendix I. The orthogonality constraint simplifies the analysis and has been employed in(Brutzkus & Globerson, 2021). We relaxed this constraint in the experiments on the synthetic data in Section 4. The lower bounds of α for some sampling strategy are provided in Appendix G.1. The experiments are implemented using the codes from https://github.com/VITA-Group/ Unified-LTH-GNN The experiments are implemented using the codes from https://github.com/williamleif/ GraphSAGE



Figure 1: Illustration of node classification in the GNN

Figure 2: Toy example of the data model. Node 1 and 2 have label +1. Nodes 3 and 4 are labeled as -1. Nodes 1 and 4 have classrelevant features. Nodes 2 and 3 have classirrelevant features. V+ = {1}, VN+ = {2}, VN-= {3}, V-= {4}

Figure 3: |D| against the importance sampling probability α

Figure 6: Number of iterations against the pruning rate β

Figure 9: Test error on Cora.

pv, v ∈ V The noiseless input feature for node v in R d zv, v ∈ V The noise factor for node v in R d yv, v ∈ V The label of node v in {-1, +1} R The degree of the original graph G W , U The neuron weights in hidden layer XN The collection of {xn}n∈N in R d×|N | r The size of sampled neighbor nodes L The size of class relevant and class irrelevant features p+ The class relevant pattern with respect to the positive label p-The class relevant pattern with respect to the negative label z The additive noise in the input features σ The upper bound of ∥z∥2 for the noise factor z M+ The mask matrix for W of the pruned model M-The mask matrix for U of the pruned model Lemma 4. For an unlucky neuron k ∈ W c (t), let w (t+1) k

Algorithm 1 terminates if the training error becomes zero or the maximum of iteration 500 is reached. If not otherwise specified, δ = 0.1, c η = 1, r = 20, α = r/R, β = 0.2, d = 50, L = 200, σ = 0.2, and |D| = 100 with the rest of nodes being test data.

Figure 12: Distribution of the cosine similarity between the node feature and the estimated classrelevant feature from the same class

Figure 18 indicates the test errors with different numbers of samples by averaging over 1000 independent trials. The red line with circle marks shows the performance of training the original dense model. The blue line with star marks concerns the model after magnitude pruning, and the test error is reduced compared with training the original model. Additional experiment by using random pruning is summarized as the black line with the diamond mark, and the test error is consistently larger than those by training on the original model.

Figure 13: |D| against the number of neurons K

Figure 19: D against the estimated importance sampling probability α.

Figure 22: Toy example of the data model. Node 1 and 2 have label +1. Nodes 3 and 4 are labeled as -1. Nodes 1 and 4 class-relevant features. Nodes 2 and 3 have classirrelevant features. V + = {1}, V N + = {2}, V N -= {3}, V -= {4}.

Figure 26: Node classification performance of GCN on Cora (sub-figures in the first column), Citeseer (sub-figures in the second column), Pubmed (sub-figures in the third column).

Figure 27: Test error on (a) Ogbn-Proteins and (b) Ogbn-Arxiv.

1+(γ-1)λ , where λ is the fraction |V N ∩ D I |/|V N |. Formally, we have Lemma 11. Suppose D

= p, and ⟨ M (t) p (v; w (t) k ) , p ⟩ = 0 for any v ∈ D.

Bound of I 3 when p ∈ P N . To bound I 4 for any fixed p ∈ P N , we maintain the subsetsS k (t) ⊆ D - such that S k (t) = {v ∈ D -| M p -}.First, when S k (t) = ∅, it is easy to verify that I 3 = 0.

, • • • , v r } = N (t) (v) and o vi = x vi , if x vi ̸ = p + or p - o vi = 0, if x vi = p + or p - .(108)When y = 1, we have⟨ M (t) p (x) , p ⟩ = ⟨ M (t) p (F(v)) , p ⟩, if M (t) p (F(v)) ̸ = p + 0 ≤ ⟨ M (t) p (F(v)) , p ⟩, if M (t)p (F(v)) = p + .

k , p ⟩ = c η • (I 3 + I 4 ) ≥ -c η |I 4 | ≥ -c η σ (1 + r 2 ) log q |D| .Proof of statement 2. When p = p -, from the definition of I 3 in (101), we knowI 3 = ⟨ E v∈D+ M (t) p (v; w (t) k ) -E v∈D-M (t) p (v; w (t) k ) , p -⟩ = -⟨ E v∈D-M (t) p (v; D -) , p -⟩ ≤ 0,(116)The update of w k in the direction of p -is upper bounded as⟨ w (t+1) k , p -⟩ -⟨ w (t) k , p -⟩ ≤ c η (I 3 + I 4 ) ≤ c η I 4 ≤ c η • σ • (1 + r 2 ) log q |D| .

, p -⟩ ≤ -σ, we haveM (t) p (v; w (t) k ) ̸ = p -, and ⟨ M (t) p (v; w (t) k ) , p -⟩ = 0 for any v ∈ D. , p -⟩ + c η • (I 3 + I 4 ) = ⟨ w (t)

, p ⟩| = c η • |I 3 + I 4 | ≤ c η • (1 + σ) • (1 + r 2 ) log q |D| . k , p ⟩| ≤ c η • (1 + σ) •

p ∈ P/p + by the definition of lucky neuron in (24).Next, from Lemma 2, we have⟨ w (t) k , p + ⟩ ≥ c η α -σ (1 + r 2 ) log q |D| • t, and ⟨ w (t) k , p ⟩ ≤ c η (1 + σ) •

e s N n=1 xn = e j pj sw j where X j = xn j ∈Xj x nj .Published as a conference paper at ICLR 2023 Then, we haveE X e s N n=1 xn ≤E X j wj |Xj | 1/2 , then we have E X e s N n=1 xn ≤ j p j e C j wj |Xj | 1Schwarz inequality, we have j w j |X j | 1/2 2 ≤ j w j j w j |X j | ≤ d X N.(135)Hence, we haveE X e s N n=1 xn ≤ e Cd X N s 2 .(136)I EXTENSION TO MULTI-CLASS CLASSIFICATIONConsider the classification problem with four classes, we use the label y ∈ {+1, -1} 2 to denote the corresponding class. Similarly to the setup in Section 2, there are four orthogonal class relevant patterns, namely p 1 , p 2 , p 3 , p 4 . In the first layer (hidden layer), we have K neurons with weights w k ∈ R d . In the second layer (linear layer), the weights are denoted as b k ∈ R 2 for the k-th neuron.

Figure 28: Illustration of graph neural network learning for multi-class classification

Some Important Notations K Number of neurons in the hidden layer; σ Upper bound of additive noise in input features; r Number of sampled edges for each node; R The maximum degree of original graph G; L the number of class-relevant and class-irrelevant patterns; α the probability of containing class-relevant nodes in the sampled neighbors of one node; β Pruning rate of model weights; β ∈ [0, 1 -1/L); β = 0 means no pruning;

This work was supported by AFOSR FA9550-20-1-0122, ARO W911NF-21-1-0255, NSF 1932196 and the Rensselaer-IBM AI Research Collaboration (http://airc.rpi.edu), part of the IBM AI Horizons Network (http://ibm.biz/AIHorizons). We thank Kevin Li and Sissi Jian at Rensselaer Polytechnic Institute for the help in formulating numerical experiments. We thank all anonymous reviewers for their constructive comments.REPRODUCIBILITY STATEMENTFor the theoretical results in Section 3.2, we provide the necessary lemmas in Appendix D and a complete proof of the major theorems based on the lemmas in Appendix E. The proof of all the lemmas are included in Appendix H. For experiments in Section 4, the implementation details in generating the data and figures are summarized in the Appendix F, and the source code can be found in the supplementary material.Joint Edge-Model Sparse Learning is Provably Efficient for GraphNeural NetworksIn the following contexts, the related works are included in Appendix A. Appendix B provides a high-level idea for the proof techniques. Appendix C summarizes the notations for the proofs, and the useful lemmas are included in Appendix D. Appendix E provides the detailed proof of Theorem 2, which is the formal version of Theorem 1. Appendix F describes the details of synthetic data experiments in Section 4, and several other experimental results are included because of the limited space in the main contexts. The lower bound of the VC-dimension is proved in Appendix G. Appendix G.1 provides the bound of α for some edge sampling strategies. Additional proofs for the useful lemmas are summarized in Appendix H. In addition, we provide a high-level idea in extending the framework in this paper to a multi-class classification problem in Appendix I.

denotes the set of neurons with positive coefficients in the linear layer. For a lucky neuron k ∈ K + , the projection of the weights on p + strictly increases (see Lemma 2 in Appendix D). For other neurons, which are named as unlucky neurons, class irrelevant patterns are identically distributed and independent of y v , the gradient generated from D + and D -are similar. Specifically, because of the offsets between D + and D -, the overall gradient is in the order of (1 + R 2 )/|D|, where R is the degree of graph. With a sufficiently large amount of training samples, the projection of the weights on class irrelevant patterns grows much slower than that on class relevant patterns. One can refer to Figure11for an illustration of neuron weights update.

Important notations of scalars and matrices

The top 5 largest singular value (SV) of the collecting feature matrix for the classes

The cosine similarity among the classes

The cosine similarity among the classes

24 and 25 show the test errors on Cora, Citeseer, and Pubmed datasets under different sampling and pruning rates, and darker colors denote lower errors. In all sub-figures, observe that the joint edge sampling and pruning can reduce the test error, which justifies the efficiency of joint edge-model sparsification. For Figures 24, by comparing the performance of joint sparsification on the same dataset but with different training samples, one can conclude that joint model-edge sparsification with a smaller number of training samples can achieve similar or even better performance than that without sparsification.

For any k 2 ∈ W c (t ′ ), from Lemma 2, we have

annex

where the second equality comes from (51), and the last equality comes from the orthogonality of p ∈ P. Similar to (57), when α J ′ < 0, we haveTherefore, (55) can be calculated aswhich completes the proof.

G.1 THE IMPORTANCE SAMPLING PROBABILITY FOR VARIOUS SAMPLING STRATEGY

In this part, we provide the bound of α for different sampling strategies. For simplification, we consider a graph such that every node degree is R (R > r), and we sample fixed r neighbor nodes for any node v ∈ V.Lemma 9 provides the value of α for uniform sampling strategy, and α is in the order of r/R and depends on the average degree of nodes containing class-relevant patterns. Formally, we have Lemma 9. For uniform sampling strategy such that all neighbor nodes are sampled with the same probability, we havewhere c is the degree of the nodes in D ∩ (V + ∪ V -). Corollary 9.1. For uniform sampling strategy, when the node in V N connects to exactly one node in V + ∩ V -, then Eα = r R . Corollary 9.2. For uniform sampling strategy, Eα ≥ cr R when r ≪ R. Remark 9.1: Given α = r R , (6) yields the same sample complexity bound for all r, which leads to no improvement by using graph sampling. In addition, from (7), we can see that the required number of iterations increase by a factor of R/r, while the sampled edges during each iteration is reduced by a factor of r/R. Considering the computational resources used in other steps, we should expect an increased total computational time.Remark 9.2: From Corollary 9.2, we can see that uniform sampling can save the sample complexity by a factor of 1 c2 when r ≪ R on average. However, one realization of α may have large variance with too small r. In practice, a medium range of r is desired for both saving sample complexity and algorithm stability.For FastGCN Chen et al. (2018) , each node is assigned with a sampling probability. We consider a simplified version of such importance sampling approach by dividing training data D into two subsets D I and D U , and the sampling probability of the nodes in the same subset is identical. Let γ be the ratio of sampling probability of two groups, i.e.,Lemma 10 describes the importance sampling strategy when sampling class-relevant nodes is higher than sampling class-irrelevant nodes by a factor of γ. From Corollary 10.1, we know that α can be improved over uniform sampling by a factor of γ. Formally, we have Published as a conference paper at ICLR 2023The corresponding loss function in (2) can be revised asThen, we initialize the weights of b k as (1, 1), (1, -1), (-1, 1), and (-1, -1) equally. Next, we divide the weights w k into four groups based on the value of b k , such thatHence, for any k in W 1 , we haveWhen x ∈ D 1 , we haveWhen x ∈ D 2 , we haveWhen x ∈ D 3 , we haveWhen x ∈ D 4 , we haveThen, for any k ∈ W 1 , we haveComparing ( 147) with (29), one can derived that the weights w (t)k will update mainly along the direction of p 1 but have bounded magnitudes in other directions. By following the steps in (142) to (147), similar results can be derived for for any k ∈ W i with 1 ≤ i ≤ 4.

