ASGNN: GRAPH NEURAL NETWORKS WITH ADAPTIVE STRUCTURE

Abstract

The graph neural network (GNN) models have presented impressive achievements in numerous machine learning tasks. However, many existing GNN models are shown to be vulnerable to adversarial attacks, which creates a stringent need to build robust GNN architectures. In this work, we propose a novel interpretable message passing scheme with adaptive structure (ASMP) to defend against adversarial attacks on graph structure. Layers in ASMP are derived based on optimization steps that minimize an objective function that learns the node feature and the graph structure simultaneously. ASMP is adaptive in the sense that the message passing process in different layers is able to be carried out over dynamically adjusted graphs. Such property allows more fine-grained handling of the noisy (or perturbed) graph structure and hence improves the robustness. Convergence properties of the ASMP scheme are theoretically established. Integrating ASMP with neural networks can lead to a new family of GNN models with adaptive structure (ASGNN). Extensive experiments on semi-supervised node classification tasks demonstrate that the proposed ASGNN outperforms the state-of-the-art GNN architectures in terms of classification performance under various adversarial attacks. Graphs, or networks, are ubiquitous data structures in many fields of science and engineering (Newman, 2018), like molecular biology, computer vision, social science, financial technology, etc. In the past few years, due to its appealing capability of learning representations through message passing over the graph structure, graph neural network (GNN) models have become popular choices for processing graph-structured data and have achieved astonishing success in various applications (Kipf and Welling, 2017; 



the supervised GNN training loss (Zhu et al., 2022) . For example, in Franceschi et al. (2019) , the graph adjacency matrix is directly learned with a GNN in a bilevel optimization way, where a full parametrization of the graph adjacency matrix is adopted. Moreover, under this full parametrization setting, structural regularizers are adopted in Jin et al. (2020) ; Luo et al. (2021) as augmentations on the training loss function to promote certain properties of the purified graph. Besides the full parametrization approach, a multi-head weighted cosine similarity metric function (Chen et al., 2020) and a GNN model (Yu et al., 2020) have also been used to parameterize the graph adjacency matrix for structure learning. Going beyond purifying the graph structures to robustify the GNN models, there are also efforts on designing robust GNN architectures via directly designing the feature aggregation schemes. Under the observation that aggregation functions such as sum, weighted mean, or the max operations can be arbitrarily distorted by only a single outlier node, Geisler et al. (2020) ; Wang et al. (2020) ; Zhang and Lu (2020) try to design robust GNN models via designing robust aggregation functions. Moreover, some works apply the attention mechanism (Veličković et al., 2018) to mitigate the influence of adversarial perturbations. For example, Zhu et al. (2019) consider the node feature following a Gaussian distribution and use the variance information to determine the attention scores. Tang et al. (2020) use clean graph information and their adversarial counterparts to train an attention mechanism to learn to assign small attention scores to the perturbed edges. In Zhang and Zitnik (2020) , the authors define an attention mechanism based on the similarity of neighboring nodes. Different from existing approaches to robustify GNNs, in this work, we propose a novel robust and interpretable message passing scheme with adaptive structure (ASMP). Based on ASMP, a family of GNN models with adaptive structure (ASGNN) can be designed. Prior works have revealed that the message passing processes in a class of GNNs are actually (unrolled) gradient steps for solving a graph signal denoising (GSD) problem (Zhu et al., 2021; Ma et al., 2021; Zhang and Zhao, 2022) . ASMP is actually generated by an alternating (proximal) gradient descent algorithm for simultaneously denoising the graph signal and the graph structure. Designed in such a principled way, ASMP is not only friendly to back-propagation training but also achieves the desired structure adaptivity with a theoretical convergence guarantee. Once trained, ASMP can be naturally interpreted as a parameter-optimized iterative algorithm. This work falls into the category of GNN architecture designs. Conceptually different from the existing robustified GNNs with fixed graph structure, ASGNN interweaves the graph purification process and the message passing process, which makes it possible to conduct message passing over different graph structures at different layers, i.e., in an adaptive graph structure fashion. Thus, an edge might be excluded in some layers but included in other layers, depending on the dynamic structure learning process. Such property allows more fine-grained handling of perturbations than existing graph purification methods that use a single graph in the entire GNN. To be more specific, the major contributions of this work are highlighted in the following. • We propose a novel message passing scheme over graphs called ASMP with convergence guarantee and specifications. To the best of our knowledge, ASMP is the first message passing scheme with adaptive structure that is designed based on an optimization problem. • Based on ASMP, a family of GNN models with adaptive structure, named ASGNN, are further introduced. The adaptive structure in ASGNN allows more fine-grained handling of noisy graph structures and strengthens the model robustness against adversarial attacks. • Extensive experiments under various adversarial attack scenarios showcase the superiority of the proposed ASGNN. The numerical results corroborate that the adaptive structure property inherited in ASGNN can help mitigate the impact of perturbed graph structure.

2. PRELIMINARIES AND BACKGROUND

An unweighted graph with self-loops is denoted as G = (V, E), where V and E denote the node set and the edge set, respectively. The graph adjacency matrix is given by A ∈ R N ×N . We denote by 1 and I the all-one column vector and the identity matrix, respectively. Given D = Diag (A1) ∈ R N ×N as the diagonal degree matrix, the Laplacian matrix is defined as L = D -A. We denote by A rw = D -1 A the random walk (or row-wise) normalized adjacency matrix and by A sym = D -1 2 AD -1 2 the symmetric normalized adjacency matrix. Subsequently, the random walk normalized and symmetric normalized Laplacian matrices are defined as L rw = I -D -1 A and L sym = I -D -1 2 AD -1 2 , respectively. X ∈ R N ×M (M is assumed to be the dimension of the node feature) is a node feature matrix or a graph signal, and its i-th row X i,: represents the feature vector at the i-th node with i = 1, . . . , N . X ij (or [X] ij ) denotes the (i, j)-th element of X with i, j = 1, . . . , N . For vector X i,: , X -1 i,: represents its element-wise inverse.

2.1. GNNS AS GRAPH SIGNAL DENOISING

In the literature (Yang et al., 2021; Pan et al., 2021; Zhu et al., 2021) , it has been realized that the message passing layers for feature learning in many GNN models could be uniformly interpreted as gradient steps for minimizing certain energy functions, which carries a meaning of GSD (Ma et al., 2021) . Recently, Zhang and Zhao (2022) further showed that some popular GNNs are neural networks induced from unrolling (proximal) gradient descent algorithms for solving specific GSD problems. Taking the approximate personalized propagation of neural predictions (APPNP) model (Klicpera et al., 2019) as an example, the initial node feature matrix Z is first pre-propcessed by a multilayer perceptron g θ (•) with model parameter θ producing an output X = g θ (Z), and then X is fed into a K-layer message passing scheme given as follows: H (0) = X, H (k+1) = (1 -α) A sym H (k) + αX, for k = 0, . . . , K -1, where H (0) denotes the input feature of the message passing process, H (k) represents the learned feature after the k-th layer, and α is the teleport probability. Therefore, the message passing of an APPNP model is fully specified by two parameters, namely, a graph structure matrix A sym and a parameter α, in which A sym assumes to be known beforehand and α is treated as a hyperparameter. From an optimization perspective, the message passing process in Eq. ( 1) can be seen as executing K steps of gradient descent to solve a GSD problem with initialization H (0) = X and step size 0.5 (Zhu et al., 2021; Ma et al., 2021; Zhang and Zhao, 2022) , which is given by minimize H∈R N ×M α H -X 2 F + (1 -α) Tr H L sym H , where X and α are given and share the same meaning as in Eq. (1). In Problem (2), the first term is a fidelity term forcing the recovered graph signal H to be as close as possible to a noisy graph signal X, and the second term is the symmetric normalized Laplacian smoothing term measuring the variation of the graph signal H, which can be explicitly expressed as Tr H L sym H = 1 2 N i=1 N j=1 A ij H i,: √ D ii - H j,: D jj 2 2 . For more technical discussions on relationships between GNNs with iterative optimization algorithms for solving GSD problems, please refer to Ma et al. (2021) ; Zhang and Zhao (2022) . Apart from using the lens of optimization to interpret existing GNN models, there are also literature (Liu et al., 2021b; Chen et al., 2021; Fu et al., 2022) working on building new GNN architectures based on designing novel optimization problems and the corresponding iterative algorithms (more discussions are provided in Appendix A).

2.2. GRAPH LEARNING WITH STRUCTURAL REGULARIZERS

Structural regularizers are commonly adopted to promote certain desirable properties when learning a graph (Kalofolias, 2016; Pu et al., 2021) . In the following, we discuss several widely used graph structural regularizers which will be incorporated into the design of ASMP. We denote the learnable graph adjacency matrix as S satisfying S ∈ S, where S = S ∈ R N ×N | 0 ≤ S ij ≤ 1, for i, j = 1, . . . , N defines the class of adjacency matrices. Under the assumption that node feature changes smoothly between adjacent nodes (Ortega et al., 2018) , the Laplacian smoothing regularization term is commonly considered in graph structure learning. Eq. ( 3) is the symmetric normalized Laplacian smoothing term, and a random walk normalized alternative can be similarly defined by replacing L sym in Eq. (3) by L rw . Real-world graphs are normally sparsely connected, which can be represented by sparse adjacency matrices. Moreover, it is also observed that singular values of these adjacency matrices are commonly small (Zhou et al., 2013; Kumar et al., 2020) . However, a noisy adjacency matrix (e.g., one perturbed by adversarial attacks) tends to be dense and to gain singular values in larger magnitudes (Jin et al., 2020) . In view of this, graph structural regularizers for promoting sparsity and/or suppressing the singular values are widely adopted in the literature of graph learning (Kalofolias, 2016; Egilmez et al., 2017; Dong et al., 2019) . Specifically, the 1 -norm of the adjacency matrix is often used to promote sparsity, defined as S 1 = N i,j=1 |S ij | . For penalizing the singular values, the 1 -norm and the 2 -norm on the singular value vector of the adjacency matrix S can help. Equivalently, they can be translated to be the nuclear norm and the Frobenius norm on S, which are given by S * = N i=1 σ i (S) and S F = N i=1 σ 2 i (S) , respectively, where σ 1 (S) ≥ • • • ≥ σ N (S) denote the ordered singular values of S. These two regularizers both restrict the scale of the singular values while the nuclear norm also promotes low-rankness. A recent study (Deng et al., 2022) points out that graph learning methods with low-rank promoting regularizers may lose a wide range of spectrum of the clean graph corresponding to important structure in the spatial domain. Thus, the nuclear norm regularizer may impair the quality of the reconstructed graph and therefore limit the performance of GNNs. Besides, the nuclear norm is not amicable for back-propagation and incurs high computational complexity (Luo et al., 2021) . Arguably, the Frobenius norm of S is a more suitable regularizer for graph structure learning in comparison with the nuclear norm.

3. THE PROPOSED GRAPH NEURAL NETWORKS

In this section, we first motivate the design principle based on jointly node feature learning and graph structure learning. Then, we develop an efficient optimization algorithm for solving this optimization problem, which eventually leads to a novel message passing scheme with adaptive structure (ASMP). After that, we provide interpretations, convergence guarantees, and specifications of ASMP. Finally, integrating ASMP with deep neural networks ends up with a new family of GNNs with adaptive structure, named ASGNNs.

3.1. A NOVEL DESIGN PRINCIPLE WITH ADAPTIVE GRAPH STRUCTURE

As discussed in Section 2.1, the message passing procedure in many popular GNNs can be viewed as performing graph signal denoising (or node feature learning) (Zhu et al., 2021; Ma et al., 2021; Pan et al., 2021; Zhang and Zhao, 2022) over a prefixed graph. Unfortunately, if some edges in the graph are task-irrelevant or even maliciously manipulated, the node feature learned may not be appropriate for the downstream tasks. Motivated by this, we propose a new design principle for message passing, that is, to learn the node feature and the graph structure simultaneously. It enables learning an adaptive graph structure from the feature for the message passing procedure. Hence, such a message passing scheme can potentially improve robustness against noisy input graph structure. Specifically, we construct an optimization objective by augmenting the GSD objective in Eq. (2) (we have used a random walk normalized graph Laplacian smoothing term) with a structure fidelity term S -A 2 F , where A is the given initial graph adjacency matrix, and the structural regularizers S 1 and S 2 F . Then we obtain the following optimization problem: minimize H∈R N ×M , S∈S p (H, S) = feature learning H -X 2 F + λTr H L rw H + γ S -A 2 F + µ 1 S 1 + µ 2 S 2 F structure learning , ( ) where H is the feature variable, S is the structure variable, and γ, λ, µ 1 , and µ 2 are parameters balancing different terms. To enable the interplay between feature learning and structure learning, the Laplacian smoothing term is concerned with S rather than A, i.e., L rw = I -D -1 S with D = Diag (S1). When adversarial attacks exist, a perturbed adjacency matrix A will be generated. Since attacks are generally designed to be unnoticeable (Jin et al., 2021) , the perturbed graph adjacency matrix is largely similar to the original graph matrix in value. In view of this, we also F . The motivation for introducing the last two regularizers has been elaborated in Section 2.2.

3.2. ASMP: MESSAGE PASSING WITH ADAPTIVE STRUCTURE

Following the idea that the message passing of a GNN model can be derived based on the optimization of a GSD objective function (Ma et al., 2021; Zhang and Zhao, 2022) , we can obtain a message passing scheme from Problem (4). Different from the existing GSD problems for GNN model design with only the feature variable, Problem ( 4) is nonconvex and much more challenging. To obtain an efficient iterative algorithm that is friendly to back-propagation training, we propose to use the alternating (proximal) gradient descent method (Parikh and Boyd, 2014) , i.e., alternatingly optimizing one variable by taking one (proximal) gradient step at a time with the other variable fixed. (Note that a joint optimization approach is also eligible, while it would lead to slower convergence than the alternating optimization approach. More details can be found in Appendix D.) We denote by H (k) and S (k) the variables at the k-th iteration (k = 0, . . . , K). In the following, the update rules for H and S will be discussed, respectively. Updating node feature matrix H: Given {H (k) , S (k) }, the subproblem with respect to feature matrix H is given by minimize H∈R N ×M H -X 2 F + λTr H L (k) rw H , where k) . One gradient step for H is computed as L (k) rw = I -Diag S (k) 1 -1 S ( H (k+1) = H (k) -η 1 2H (k) -2X + 2λL (k) rw H (k) = H (k) -η 1 2H (k) -2X + 2λ I -Diag S (k) 1 -1 S (k) H (k) = (1 -2η 1 -2η 1 λ) H (k) + 2η 1 λDiag S (k) 1 -1 S (k) H (k) + 2η 1 X, where η 1 denotes the step size. Updating graph structure matrix S: Given {H (k+1) , S (k) } and Tr(H (k+1) L rw H (k+1) ) = Tr(H (k+1) H (k+1) ) -Tr(H (k+1) Diag(S1) -1 SH (k+1) ), the subproblem for S becomes minimize S∈S γ S -A 2 F -λTr H (k+1) Diag S1 -1 SH (k+1) + µ 1 S 1 + µ 2 S 2 F . (6) Due to the non-smoothness of the objective function, we apply one step of the proximal gradient descent (Parikh and Boyd, 2014) for this problem. Define T (k) = (2γ + 2µ 2 ) S (k) -2γA -λDiag S (k) 1 -1 H (k+1) (H (k+1) ) + λDiag Diag S (k) 1 -1 S (k) H (k+1) (H (k+1) ) Diag S (k) 1 -1 1 . One step of proximal gradient descent is given as follows (details are given in Appendix B): S (k+1) = prox η2(µ1 • 1 +I S (•)) S (k) -η 2 T (k) , where η 2 is the step size and I S (S) denotes the indicator function taking value 0 if S ∈ S and +∞ otherwise. Moreover, the proximal operator in Eq. ( 7) can be computed analytically as S (k+1) = min 1, ReLU S (k) -η 2 T (k) -η 2 µ 1 11 , where ReLU(X) = max{0, X}. In conclusion, the overall procedure of ASMP can be summarized as follows: H (k+1) = (1 -2η 1 -2η 1 λ) H (k) + 2η 1 λDiag S (k) 1 -1 S (k) H (k) + 2η 1 X, S (k+1) = min 1, ReLU S (k) -η 2 T (k) -η 2 µ 1 11 , k = 0, . . . , K-1. (ASMP) The ASMP can be interpreted as the standard message passing (i.e., the update step of H) with extra operations that adaptively adjust the graph structure (i.e., the update step of S). Therefore, an edge of the graph included in some layers may be excluded or down-weighted in other layers. A pictorial illustration of the ASMP procedure is provided in Figure 1 . A K-layer ASMP can be fully specified by parameters γ, λ, µ 1 , µ 2 , η 1 , and η 2 , which we generally denote as ASMP K X, A, γ, λ, µ 1 , µ 2 , η 1 , η 2 . Note that ASMP is general enough to cover several existing propagation rules as special cases. Remark 1 (Special cases). If we use a fixed graph structure S (0) = • • • = S (K) = A in ASMP, i.e., µ 1 = µ 2 = γ = 0, the ASMP reduces to a classical message passing procedure that only performs feature learning. Specifically, with η 1 = 1 2+2λ and the symmetric normalized adjacency matrix, ASMP can be written as H (k+1) = λ 1 + λ A sym H (k) + 1 1 + λ X. Case I: when λ = 1 α -1, the operation in Eq. ( 8) becomes the message passing rule of APPNP (Klicpera et al., 2019) : H (k+1) = (1 -α) A sym H (k) + αX. Case II: when λ = ∞, the operation in Eq. ( 8) becomes the simple aggregation in many GNN models such as the GCN model (Kipf and Welling, 2017) and the simple graph convolution (SGC) model (Wu et al., 2019a) : H (k+1) = A sym H (k) . Instead of updating both S and H once, we can also choose to update them for several steps. The convergence of ASMP is guaranteed with proper selections of the step sizes as demonstrated in Theorem 4. Before proceeding to the convergence result, we first introduce some standard assumptions on the node feature vectors and the degree matrices, which are widely adopted in the literature (Garg et al., 2020; Liao et al., 2021; Cong et al., 2021) . Assumption 2. The energy of the node feature is uniformly upperbounded, i.e., ||H i,: || 2 ≤ B for i = 1, . . . , N and k = 0, . . . , K. Assumption 3. The diagonal elements of the degree matrix is lowerbounded by a positive constant, i.e., min i D ii = c > 0 for i = 1, . . . , N . Theorem 4. Let H (0) = X and S (0) = A. Under Assumption 2 and Assumption 3, the sequence {H (k) , S (k) } K k=1 generated by (ASMP) with 0 < η 1 < 1 1+2λ and 0 < η 2 < 1 γ+µ2+(1+ 1 c N √ N ) λ c 2 N 2 B 2 converges to a first-order stationary point of Problem (4) denoted as {H * , S * } with rate inf k≥K H (k+1) -H (k) 2 F + S (k+1) -S (k) 2 F ≤ 1 ρK p H (0) , S (0) -p H * , S * , where ρ is a constant depending on the step sizes and Lipschitz constants, and p (H, S) represents the objective of Problem (4). Proof. The proof for Theorem 4 is in Appendix C. Note that if multiple updating steps are used for S and H in ASMP, this convergence result still holds (Bolte et al., 2014; Nikolova and Tan, 2017) .

3.3. ASGNN: GRAPH NEURAL NETWORKS WITH ADAPTIVE STRUCTURE

In this section, we introduce a family of GNNs leveraging the ASMP scheme. Integrating (ASMP) with a machine learning model g θ (•) (e.g., a multilayer perceptron) with H (0) = X = g θ (Z), a K-layer ASGNN model is defined as follows: H (K) = ASMP K H (0) , A, γ, λ, µ 1 , µ 2 , η 1 , η 2 . In ASGNN, we have chosen a decoupled architecture similar to APPNP (Klicpera et al., 2019) and deep adaptive GNN (DAGNN) (Liu et al., 2021a) . Specially, in ASGNN, the model g θ will first transform the initial node feature as X = g θ (Z), and then ASMP performs K steps of message passing with input g θ (Z). The parameters in ASMP, i.e., γ, λ, µ 1 , and µ 2 , are set to be weights to be learned from the downstream tasks. For example, in semi-supervised node classification tasks, the loss function is chosen as the cross-entropy classification loss on the labeled nodes and the whole model is trained in an end-to-end way. Since ASMP is derived from the alternating (proximal) gradient descent algorithm, a trained ASMP is naturally a parameter-optimized iterative algorithm. The step sizes η 1 and η 2 in ASMP can be chosen according to the results in Theorem 4. However, such choices seem to be too conservative in practice and may lead to slow convergence. Thus, we may also consider the step sizes η 1 and η 2 as learnable parameters. Convergence property of ASMP with learned step sizes will be showcased in the experiments. In conclusion, there are in total six parameters in ASMP considered during the learning process. In this paper, we have focused on problems in which there is an initial graph structure, while the use of ASGNN may also be extended to scenarios where the initial structure is not available. In such case, we can first create a k-nearest neighbor graph or use some optimization methods (Dong et al., 2016; Kalofolias, 2016; Kumar et al., 2020) to learn a graph structure based on the node feature. Such extensions of ASGNN can be promising future research directions.

4. EXPERIMENTS

In this section, we conduct experiments to validate the effectiveness of the proposed ASGNN model. First, we introduce the experimental settings. Then, we assess the performance of ASGNN on semisupervised node classifications tasks and investigate the benefits of introducing adaptive structure into GNNs against global attacks and targeted attacks. Finally, we analyze the structure denoising ability and the convergence property of ASMP with the learned step sizes.

Datasets:

We perform numerical experiments on 4 real-world citation graphs, i.e., Cora (Sen et al., 2008) , Citeseer (Sen et al., 2008) , Cora-ML (Bojchevski and Günnemann, 2018) , and ACM (Wang et al., 2019) , and only consider the largest connected component in each dataset. Baselines: To evaluate the effectiveness of ASGNN, we compare it with GCN and several benchmarks that are designed from different perspectives to robustify the GNNs, including GCN-Jaccard (Wu et al., 2019b ) that pre-processes the graph by eliminating edges with low Jaccard similarity of node feature vectors, GCN-SVD (Entezari et al., 2020 ) that applies the low-rank approximation of the given graph adjacency matrix, GNNGuard (Zhang and Zitnik, 2020) that defines an attention mechanism based on the similarity of neighboring nodes, Pro-GNN (Jin et al., 2020) that jointly learns a graph structure and a GNN model guided by some predefined structural priors, and Elastic GNN (Liu et al., 2021b) that utilizes trend filtering instead of Laplacian smoothing to promote robustness. The code is implemented based on PyTorch Geometric (Fey and Lenssen, 2019) . For GCN-Jaccard, GCN-SVD, and Pro-GNN, we use the implementation provided in DeepRobust (Li et al., 2020) . For GNNGuard and Elastic GNN, we follow the implementation provided in the original papers (Zhang and Zitnik, 2020; Liu et al., 2021b) . Parameter settings: For all the experimental results, we give the average performance and standard variance with 10 independent trials. For each graph, we randomly select 10%/10%/80% of nodes for training, validation, and testing. The Adam optimizer is used in all experiments. The models' hyperparameters are tuned based on the results of the validation set. The search space of hyperparameters are as follows: 1) learning rate: {0.005, 0.01, 0.05}; 2) weight decay: {0, 5e-5, 5e-4}; 3) dropout rate: {0.1, 0.5, 0.8}; 4) model depth: {2, 4, 8, 16}. For GCN-Jaccard, the threshold of Jaccard similarity for removing dissimilar edges is chosen from {0.01, 0.02, 0.03, 0.04, 0.05, 0.1}. For GCN-SVD, the reduced rank of the graph is tuned from {5, 10, 15, 50, 100, 200}. For Elastic GNN, the regularization coefficients are chosen from {3, 6, 9}. For Pro-GNN, we adopt the hyperparameters provided in their paper (Jin et al., 2020) .

4.2. PERFORMANCE UNDER ADVERSARIAL ATTACK

The performance of the compared models is evaluated under the training-time adversarial attacks (Wang and Gong, 2019; Zügner and Günnemann, 2019) , i.e., the graph is first attacked, and then the GNN models are trained on the perturbed graph. In the following, we conduct experiments under both the global attack and the targeted attack. Specifically, the global attack aims to reduce the overall performance of GNNs (Zügner and Günnemann, 2019) while the targeted attack aims to fool GNNs on some specific nodes (Zügner et al., 2018) . Global Attack: We first test the node classification performance of ASGNN and other baselines under global attack using a representative global attack method called meta-attack (Zügner and Günnemann, 2019) . We vary the perturbation rate, i.e., the ratio of changed edges, from 0% to 25% with an increasing step of 5%. The results are reported in Table 1 . From the table, we observe that the proposed ASGNN model outperforms other methods in most cases. For instance, ASGNN improves GCN over 30% on the Cora-ML dataset at a 20% perturbation rate and over 20% on the Cora dataset at a 25% perturbation rate. On Cora, Citeseer, and ACM datasets, ASGNN beats other baselines at various perturbation rates by a large margin. The GCN-Jaccard method and the GNNGuard slightly outperforms ASGNN on the Cora-ML dataset at a 15%-25% perturbation rate, while they performs poorly on other datasets. Specifically, on the other three datasets under the 25% perturbation rate, ASGNN outperforms GCN-Jaccard by 22%, 10%, and 10%, respectively. Such inspiring results demonstrate that ASGNN can better resist global attack than other baseline methods. Targeted Attack: For the targeted attack, we use a representative method called NETTACK (Zügner et al., 2018) . Following existing works (Zhu et al., 2019; Jin et al., 2020) , we vary the perturbation number made on every node, i.e., the number of edge removals/additions, from 0 to 5 with an increasing step of 1. The results are reported in Table 2 . We choose the nodes in the test set with degrees larger than 10 as targeted nodes and the reported classification performance is evaluated on target nodes. Thus, the results in 

4.3. ROBUSTNESS OF ASGNN

The message passing scheme in ASGNN is designed based on jointly node feature learning and graph structure learning principle. To validate that ASGNN can help purify (i.e., denoise) the structure, we evaluate the quality of the graphs generated by ASGNN via evaluating the performance of a GCN model trained on the generated graphs in the last layers of ASGNN (which we named as GCN-AS). Under the assumption that the GCN trained on a purer graph will give better performance, the performance of GCN-AS can indicate the quality of the learned graph in ASGNN. We conduct experiments on different datasets under targeted attack with five perturbations per node and the results are in Table 3 . It can be seen that the performance of GCN-AS is much better than GCN. Moreover, GCN-AS even outperforms other defense models on some datasets, e.g., GCN-AS outperforms all other baselines on ACM dataset. These inspiring results indicate that ASGNN can mitigate the influence of adversarial attacks. We provide some additional experiments in Appendix E, including the runtime analysis in Appendix E.1, the discussion of the leaned coeeficients in ASGNN in Appendix E.2, the sparsity level of the graphs generated in ASGNN in Appendix E.3, and the convergence analysis of ASMP with learned step sizes in Appendix E.4.

5. CONCLUSION

In this work, we have developed an interpretable robust message passing scheme named ASMP following the jointly node feature learning and graph structure learning principle. ASMP is provably convergent and it has a clear interpretation as a standard message passing scheme with adaptive structure. Integrating ASMP with neural network components, we have obtained a family of robust graph neural networks with adaptive structure. Extensive experiments on real-world datasets with various adversarial attack settings corroborate the effectiveness and the robustness of the proposed graph neural network architecture.

A RELATED WORK ON OPTIMIZATION-INDUCED GRAPH NEURAL NETWORK DESIGN

Since ASGNN proposed in this paper is induced from an optimization algorithm, in this section, we give more literatue review on optimization-induced GNN model design to supplement our discussion. The idea of optimization-induced GNN model design partly stems from the observation that many primitive handcrafted GNN models could be nicely interpreted as (unrolled) iterative algorithms for solving a GSD optimization problem (Ma et al., 2021; Zhu et al., 2021; Zhang and Zhao, 2022) . Based on this observation, many papers aim at strengthening the capability of GNNs by carefully designing the underlying optimization problems and/or the iterative algorithms solving it. For example, inspired by the idea of trend filtering (Wang et al., 2015) , Liu et al. (2021b) replace the Laplacian smoothing term (which is in the form of 2 -norm) in the GSD problem with an 2,1 -norm to promote robustness against abnormal edges. Also for robustness pursuit, Yang et al. (2021) replace the Laplacian smoothing term with nonlinear functions imposed over pairwise node distances. Since the classical Laplacian smoothing term in GSD only promotes smoothness over connected nodes, the authors in Zhang et al. (2020) ; Zhao and Akoglu (2020) further suggest promoting the non-smoothness over the disconnected nodes, which is achieved by deducting the sum of distances between disconnected pairs of nodes from the denoising objective. In Jiang et al. (2022) , the authors augment the GSD objective with a fairness term to fight against large topology bias. Most recently, Fu et al. (2022) propose a p-Laplacian message passing scheme and a p GNN model, which is capable of dealing with heterophilic graphs and is robust to adversarial perturbations. Apart from that, Ahn et al. ( 2022) designs a novel regularization term to build heterogeneous GNNs. Although there is rich literature on optimization-induced GNN model design, all of them are focusing on learning the node feature matrix. The idea of this paper is similar to them in terms of the GNN design philosophy, however, we design an objective to jointly learn the node feature and the graph structure which was rarely covered in the literature. B DERIVATION OF THE PROXIMAL GRADIENT STEP IN EQ. ( 7) For the S-block optimization, i.e., Eq. ( 6), we define the objective function except the term µ 1 S 1 as f S (S), i.e., f S (S) = γ S -A 2 F -λTr H D -1 SH + µ 2 S 2 F , where D = Diag (S1). In this section, we first derive the expression of ∇f S (S) and then compute the proximal operator in Eq. ( 7).

B.1 ON COMPUTATION OF ∇f S (S)

We first focus on the gradient computation of the second term in Eq. ( 9). For the graph degree matrix, we have D = Diag (S1) = S11 I, where denotes the Hadamard product. Based on the rule of matrix calculus, the differential of the scalar function Tr(H D -1 SH) with respect to matrix variable S can be computed as follows: d Tr H D -1 SH = Tr H D -1 d (S) H + H d D -1 SH . For an invertible D (note that, in this paper, the graphs considered all have self loops, so D is always invertible), we have d D -1 = -D -1 d (D) D -1 . Thus, we can get d Tr H D -1 SH =Tr H D -1 d (S) H -H D -1 d S11 I D -1 SH =Tr HH D -1 d (S) -D -1 SHH D -1 I d (S) 11 =Tr HH D -1 -1Diag D -1 SHH D -1 d (S) . Since d Tr(H D -1 SH) = Tr dTr(H D -1 SH) dS d(S) , we have dTr H D -1 SH dS = D -1 HH -Diag D -1 SHH D -1 1 . For other terms in f S (S), the gradients with respect to S can be easily computed. Finally, we obtain ∇f S (S) = 2γ (S -A) -λ D -1 HH -Diag D -1 SHH D -1 1 + 2µ 2 S. B.2 ON COMPUTATION OF THE PROXIMAL STEP Lemma 5. Given a matrix M ∈ R N ×N , we have prox κ • 1 +I S (•) (M) = min 1, ReLU M -κ11 , where ReLU(X) = max{0, X}. Proof. The proximal step in Eq. ( 10) can be rewritten as the following optimization problem: minimize S∈S 1 2 S -M 2 F + κ S 1 . It is easy to observe that Problem ( 11) is decoupled over different elements in matrix S. Therefore, each S ij with i, j = 1, . . . , N can be optimized individually by solving the following optimization problem: minimize 0≤Sij ≤1 h ij (S ij ) = 1 2 S ij -M ij 2 F + κ |S ij | . According to Eq. ( 10), we have S ij = min {1, ReLU (M ij -κ)} =    1 1 + κ ≤ M ij M ij -κ κ ≤ M ij < 1 + κ 0 M ij < κ. Then, Lemma 5 can be proved by showing that for i, j = 1, . . . N , S ij is the optimal solution for Problem (12). The optimality of S ij can be validated by verifying the optimality condition, i.e., there exists a subgradient ψ ∈ ∂h ij S ij such that ψ S ij -S ij ≥ 0 for all 0 ≤ S ij ≤ 1. Observe that the subdifferential of h ij (S ij ) is computed as follows: ∂h ij (S ij ) = S ij -M ij + κ S ij > 0 S ij -M ij + κ S ij = 0, where can be any constant satisfying -1 ≤ ≤ 1. Then, the subdifferential ∂h S ij is given by ∂h ij (S ij ) =    1 -M ij + κ 1 + κ ≤ M ij 0 κ ≤ M ij < 1 + κ -M ij + κ M ij < κ. In the following, we will show that the optimality condition holds for each of the above cases. 1. For 1 + κ ≤ M ij , we have S ij = 1 and ψ = 1 -M ij + κ ≤ 0. Since S ij ≤ 1, we can get ψ(S ij -S ij ) ≥ 0. 2. For κ ≤ M ij < 1 + κ, we have ψ = 0 and hence, ψ(S ij -S ij ) = 0 for all 0 ≤ S ij ≤ 1. 3. For M ij < κ, we have S ij = 0 and ψ = -M ij + κ with being any constant satisfying -1 ≤ ≤ 1. Thus, we can choose = 1, which leads to ψ > 0. Since S ij ≥ 0, we can get ψ(S ij -S ij ) ≥ 0. To derive the Lipschitz constant of ∇f S (S), we first present several useful lemmas. Lemma 9. Under Assumption 2 that the norm of node feature vectors is upperbounded, i.e., H i,: 2 ≤ B, we have HH 2 ≤ HH F = N i=1 N j=1 H i,: H j 2 ≤ N i=1 N j=1 B 4 = N B 2 . Lemma 10. Given S ∈ S and D = Diag (S1), under Assumption 3 that the diagonal elements of D is lowerbounded by a positive constant, i.e., min i D ii = c > 0 for i = 1, . . . , N , we have D -1 S 2 ≤ D -1 S F = N i=1 N j=1 S ij D ii 2 ≤ N c . Lemma 11. Given S 1 , S 2 ∈ S, D 1 = Diag (S 1 1), and D 2 = Diag (S 2 1), under Assumption 3 that the diagonal elements of the degree matrix is lowerbounded by a positive constant, i.e., min i D ii = c > 0 for i = 1, . . . , N , we have D -1 1 -D -1 2 F ≤ 1 c 2 N S 1 -S 2 F , and D -1 1 S 1 -D -1 2 S 2 2 ≤ 1 c + 1 c 2 N 2 S -S 2 F . Proof. It can be observed that D -1 1 -D -1 2 F = N i=1 [D 1 ] ii -[D 2 ] ii [D 1 ] ii [D 2 ] ii 2 ≤ 1 c 2 N i=1 N max j=1,...,N [S 1 ] ij -[S 2 ] ij 2 ≤ 1 c 2 N S 1 -S 2 F , which proves Eq. ( 14). Based on Eq. ( 14), we further have D -1 1 S 1 -D -1 2 S 2 2 ≤ D -1 1 S 1 -D -1 2 S 2 F ≤ D -1 1 S 1 -D -1 1 S 2 + D -1 1 S 2 -D -1 2 S 2 F ≤ D -1 1 2 S 1 -S 2 F + S 2 2 D -1 1 -D -1 2 F ≤ D -1 1 2 S 1 -S 2 F + N D -1 1 -D -1 2 F ≤ 1 c + 1 c 2 N 2 S 1 -S 2 F , through which the proof is completed. Based on Lemma 9 and Lemma 11, the second term in Eq. ( 13), i.e., λ HH 2 D -1 1 -D -1 2 F , can be upperbounded as follows: λ HH 2 D -1 1 -D -1 2 F ≤ λ c 2 N 2 B 2 S 1 -S 2 F . ( ) For the third term in Eq. ( 13), we have λ √ N D -1 1 S 1 HH D -1 1 -D -1 2 S 2 HH D -1 2 F =λ √ N D -1 1 S 1 HH D -1 1 -D -1 1 S 1 HH D -1 2 + D -1 1 S 1 HH D -1 2 -D -1 2 S 2 HH D -1 2 F ≤λ √ N D -1 1 S 1 HH D -1 1 -D -1 2 F + λ √ N D -1 1 S 1 -D -1 2 S 2 HH D -1 2 F ≤λ √ N D -1 1 S 1 2 HH F D -1 1 -D -1 2 2 + λ √ N D -1 1 S 1 -D -1 2 S 2 2 HH F D -1 2 2 . Based on Lemma 9, Lemma 10, and Lemma 11, we can get the following result: λ √ N D -1 1 S 1 HH D -1 1 -D -1 2 S 2 HH D -1 2 F ≤ 1 + 2 c N 2 λ c 2 N √ N B 2 S 1 -S 2 F . (17) Substituting the results in Eq. ( 16) and Eq. ( 17) into Eq. ( 13) gives ∇f S (S 1 ) -∇f S (S 2 ) F ≤ 2γ + 2µ 2 + λ c 2 N 2 B 2 + 1 + 2 c N 2 λ c 2 N √ N B 2 S 1 -S 2 F ≤ 2γ + 2µ 2 + 1 + 1 c N √ N 2λ c 2 N 2 B 2 S 1 -S 2 F . Therefore, function f S (S) is L-smooth with L S = 2γ + 2µ 2 + 2λ c 2 N 2 B 2 + 2λ c 3 N 3 √ N B 2 and the proof is completed. Based on the results in Lemma 6 and Lemma 8, we can conclude that f H (H) and f S (S) are both L-smooth. To ensure the monotonically decreasing property of (ASMP), the step sizes must satisfy (Parikh and Boyd, 2014) : 0 < η 1 < 2 L H = 1 1 + 2λ and 0 < η 2 < 2 L S = 1 γ + µ 2 + 1 + 1 c N √ N λ c 2 N 2 B 2 . Under such condition, the convergence of (ASMP) to a first-order stationary point of Problem ( 4) can be readily obtained based on the results for alternating proximal gradient descent method in Bolte et al. (2014) ; Nikolova and Tan (2017) with convergence rate inf k≥K H (k+1) -H (k) 2 F + S (k+1) -S (k) 2 F ≤ 1 ρK p H (0) , S (0) -p H * , S * , where ρ = min 1 η1 -L H 2 , 1 η2 -L S 2 and p (H, S) represents the objective function in Eq. ( 4).

D DISCUSSION ON THE JOINT OPTIMIZATION APPROACH

In this paper, we have used the alternating optimization approach to induce the ASMP scheme, while another natural idea is to apply a joint optimization approach for Problem (4). In this section, we will show that the joint optimization approach actually is inferior compared to the alternating one, since joint optimization would lead to slower convergence, which motivates the use of alternating optimization in ASMP. We define the smooth part of the objective in Problem (4) as f (H, S) = H -X 2 F + γ S -A 2 F + λTr H L rw H + µ 2 S 2 F . The L-smoothness of f (H, S) is deomnstrated in the following lemma. Lemma 12. Function f (H, S) is L-smooth with L = max L 2 H + (1 + 1 c N √ N ) 2 4λ 2 c 2 N B 2 , L 2 S + (1 + 1 c N 2 ) 2 4λ 2 c 2 N B 2 , i.e., for any H 1 , H 2 ∈ R N ×M and S 1 , S 2 ∈ R N ×N , the following inequality holds: ∇f (H 1 , S 1 ) -∇f (H 2 , S 2 ) F ≤ L H 1 -H 2 2 F + S 1 -S 2 2 F . Proof. Observe that ∇ H f (H 1 , S 1 ) ∇ S f (H 1 , S 1 ) - ∇ H f (H 2 , S 2 ) ∇ S f (H 2 , S 2 ) 2 F = ∇ H f (H 1 , S 1 ) -∇ H f (H 2 , S 2 ) 2 F + ∇ S f (H 1 , S 1 ) -∇ S f (H 2 , S 2 ) 2 F = ∇ H f (H 1 , S 1 ) -∇ H f (H 1 , S 2 ) + ∇ H f (H 1 , S 2 ) -∇ H f (H 2 , S 2 ) 2 F + ∇ S f (H 1 , S 1 ) -∇ S f (H 2 , S 1 ) + ∇ S f (H 2 , S 1 ) -∇ S f (H 2 , S 2 ) 2 F ≤L 2 H H 1 -H 2 2 F + L 2 S S 1 -S 2 2 F + ∇ H f (H 1 , S 1 ) -∇ H f (H 1 , S 2 ) 2 F + ∇ S f (H 1 , S 1 ) -∇ S f (H 2 , S 1 ) 2 F , where the Lipschitz constants L H and L S are given in Lemma 6 and Lemma 8. In the following, we will derive the upper bound for the third and the fourth term in Eq. ( 19). Based on Lemma 11, the term ∇ H f (H 1 , S 1 ) -∇ H f (H 1 , S 2 ) F can be upperbounded as follows: ∇ H f (H 1 , S 1 ) -∇ H f (H 1 , S 2 ) F = 2λ I -D -1 1 S 1 H 1 -2λ I -D -1 2 S 2 H 1 F (20) ≤2λ D -1 1 S 1 -D -1 2 S 2 2 H 1 F ≤ 1 + 1 c N 2 2λ c √ N B S 1 -S 2 F . Besides, for the forth term, we have ∇ S f (H 1 , S 1 ) -∇ S f (H 2 , S 1 ) F ≤ λD -1 1 H 1 H 1 -H 2 H 2 -λDiag D -1 1 S 1 H 1 H 1 D -1 1 -D -1 1 S 1 H 2 H 2 D -1 1 1 F ≤λ D -1 1 2 H 1 H 1 -H 2 H 2 F + λ √ N D -1 1 S 1 2 H 1 H 1 -H 2 H 2 F D -1 1 2 . According to Lemma 10, we have ∇ S f (H 1 , S 1 ) -∇ S f (H 2 , S 1 ) F ≤ 1 + 1 c N √ N λ c H 1 H 1 -H 2 H 2 F . Then, we can get ∇ S f (H 1 , S 1 ) -∇ S f (H 2 , S 1 ) F = 1 + 1 c N √ N λ c H 1 H 1 -H 1 H 2 + H 1 H 2 -H 2 H 2 F ≤ 1 + 1 c N √ N λ c H 1 2 H 1 -H 2 F + H 1 -H 2 F H 2 2 ≤ 1 + 1 c N √ N λ c H 1 F + H 2 F H 1 -H 2 F ≤ 1 + 1 c N √ N 2λ c √ N B H 1 -H 2 F . Substituting the results in Eq. ( 21) and Eq. ( 22) into Eq. ( 19) gives ∇ H f (H 1 , S 1 ) ∇ S f (H 1 , S 1 ) - ∇ H f (H 2 , S 2 ) ∇ S f (H 2 , S 2 ) 2 F ≤ L 2 H + 1 + 1 c N √ N 2 4λ 2 c 2 N B 2 H 1 -H 2 2 F + L 2 S + 1 + 1 c N 2 2 4λ 2 c 2 N B 2 S 1 -S 2 2 F . Thus, function f (H, S) is L-smooth with L = max L 2 H + 1 + 1 c N √ N 2 4λ 2 c 2 N B 2 , L 2 S + 1 + 1 c N 2 2 4λ 2 c 2 N B 2 , through which the proof is completed. This result in Lemma 12 indicates that the Lipschitz constant L is larger than both L H and L S . Practical graphs are commonly large graphs; i.e., the number of nodes N would dominate the other constants in L, i.e., c, λ, and B. Therefore, if we use the joint optimization approach, the Lipschitz constant is larger than L H and L S by a large margin. After deriving the Lipschitz constant in joint optimization, we compare its convergence rate with the alternating optimization approach. Denote {H (k) , S (k) } K k=0 as the sequence generated by the above joint optimization approach. Following the results in Bolte et al. (2014) ; Nikolova and Tan (2017) , the convergence property of the joint optimization approach is also given in Eq. ( 18) with ρ = 1 η -L 2 . Theoretically, to guarantee the sufficient descent of the objective at each step, ρ can be chosen to be min{ 1 η1 -L H 2 , 1 η2 -L S 2 } in the alternating optimization approach. Due to the fact that L > max {L H , L S } according to Eq. ( 23), the alternating optimization approach is allowed to adopt a larger step size at each block than the joint optimization approach, resulting in a faster convergence behavior of the sequence. Motivated by this fact, we develop ASMP based on the alternating procedure rather than the joint one so that the resulting message passing structure contains fewer layers to achieve the similar or even better numerical performance compared to the joint one.

E ADDITIONAL EXPERIMENTS E.1 RUNTIME ANALYSIS

The computation complexity of ASMP is in order O N 2 d , where N is the number of nodes and d is the feature dimension. This is larger than the complexity of simple graph convolution, which is in order O (Ed) with E being the number of edges. To better understand the complexity of ASGNN and other baselines, we provide a runtime comparison of different models. Specifically, we train different models for 200 epochs on Cora dataset under nettack with one perturbation per target node and the averaged runtimes from ten trials are shown in the following table. From the table, we can see that although ASGNN requires more computation time than GCN, GCN-Jaccard, and GCN-SVD, it is much faster than Pro-GNN and the complexity of ASGNN is comparable to GNNGuard and Elastic GNN.

E.2 THE LEARNED COEFFICIENTS IN ASGNN

To better understand how different terms in the joitn node feature learning and graph structure learning objective, we provide the learned ASGNN coefficients on Cora dataset under targeted attack below. From the table, we can see that the value of λ is much larger than other coefficients. Besides, the value of the coefficients do not vary much under different perturbation numbers, indicating the importance of different terms are not affected by the perturbation number.

E.3 SPARSITY LEVEL OF THE LEARNED GRAPHS IN ASGNN

The sparsity level of the graphs (percentage of zero elements in the learned adjacency matrix) generated in the last layer of ASGNN are provided below. Note that the sparsity level can be tuned by constraining the learning space of µ 1 . For example, we can fix µ 1 to be a large value, which will lead to sparser graphs. According to Theorem 4, the convergence of ASMP is guaranteed with proper choices of step sizes. Since the step sizes are set to be learnable parameters, we provide an additional experiment to evaluate the convergence of ASMP with learned step sizes empirically. To evaluate the convergence property of ASMP with learned step sizes, we conduct experiments on Cora, Citeseer, and Cora-ML datasets at a 25% perturbation rate under meta-attack. Specially, we train a 4-layer ASGNN model and observe that the learned step sizes do not satisfy the condition in Theorem 4. In the following, we investigate the empirical convergence property of ASMP. Since we use a recurrent structure in ASGNN, i.e., the step sizes used in different layers are the same, we are able to extend the trained 4-layer ASGNN model to a deeper one. The values of the objective function in Problem (4) in different layers are showcased in Figure 2 , where the objective values are normalized by dividing the objective value in the first layer. From Figure 2 , we can find that ASMP with learned step sizes can monotonically decrease the objective function value during the message passing process. Note that although the monotonic decreasing property does not hold in 16-18 layers on the Cora-ML dataset, this may be mainly due to the fact that the step sizes are learned only based on a 4-layer model. The results indicate that although the learned step sizes do not satisfy the condition in Theorem 4, they still ensure the monotonic decrease of the objective function value in practice.



Figure 1: Illustration of ASMP. (ASMP takes the node feature matrix X and graph structure matrix A as input. Different colors of the feature indicate different embedding values. Different width of the edges indicate different weight values. ASMP updates the node feature and the graph structure in an alternating way.)

Figure 2: The objective value in Problem (4) during ASMP.

Node classification performance (accuracy ± std) under global attack (Bold: the best model; 85.34 :: ± ::: 0.39 81.75 ± 0.49 75.15 ± 0.64 80.14 ± 0.15 82.94 ± 0.28 84.80 ± 0.58 85.38 ± 0.24 5 79.71 ± 0.48 77.81 ± 0.52 73.71 ± 0.42 78.71 ± 0.33 82.20 ± 0.35 ::::: 82.26 :: ± ::: 0.69 82.31 ± 0.53 10 74.28 ± 0.79 74.38 ± 0.30 65.85 ± 0.39 75.35 ± 0.24 79.30 ± 0.64 ::::: 79.47 :: ± ::: 1.52 80.31 ± 0.61 15 69.05 ± 0.77 72.54 ± 0.31 65.33 ± 0.47 73.54 ± 0.57 77.69 ± 0.74 ::::: 77.84 :: ± ::: 1.08 78.11 ± 0.76 20 57.76 ± 1.01 71.76 ± 0.48 60.85 ± 0.74 70.99 ± 0.29 ::::: 74.16 :: ± ::: 1.02 63.68 ± 0.27 77.04 ± 0.59 25 52.67 ± 1.00 69.67 ± 0.46 59.31 ± 0.47 68.79 ± 0.82 ::::: 71.19 :: ± ::: 1.27 62.90 ± 3.37 75.18 ± 0.97 ± ::: 0.54 72.09 ± 0.49 68.34 ± 0.39 73.00 ± 0.50 73.35 ± 0.47 73.82 ± 0.43 73.99 ± 0.93 5 72.57 ± 0.93 70.79 ± 0.30 67.59 ± 0.43 72.29 ± 0.39 73.16 ± 0.42 ::::: 73.30 :: ± ::: 0.37 73.35 ± 0.41 10 71.21 ± 1.44 70.27 ± 0.62 67.38 ± 0.65 70.15 ± 0.36 ::::: 72.78 :: ± ::: 0.79 ::::: 72.78 :: ± ::: 0.66 72.83 ± 0.56 15 68.00 ± 1.04 69.97 ± 1.49 66.47 ± 0.51 70.42 ± 0.57 71.55 ± 0.73 ::::: 71.73 :: ± ::: 1.03 71.85 ± 1.83 20 59.75 ± 0.83 69.49 ± 0.71 65.83 ± 0.69 69.28 ± 0.41 ::::: 70.07 :: ± ::: 1.12 61.55 ± 1.82 71.06 ± 3.09 25 59.98 ± 0.98 68.14 ± 0.36 62.34 ± 0.61 68.58 ± 0.52 ::::: 69.73 :: ± ::: 0.93 63.98 ± 2.17 70.03 ± 3.45 ± 0.07 84.68 ± 0.32 82.96 ± 0.27 79.25 ± 0.30 79.48 ± 0.40 87.01 ± 0.28 ::::: 86.68 :: ± ::: 0.43 5 80.99 ± 0.50 81.80 ± 0.37 81.78 ± 0.46 79.07 ± 0.11 78.57 ± 0.16 ::::: 84.68 :: ± ::: 0.25 84.80 ± 0.80 10 74.57 ± 0.75 80.35 ± 0.24 81.75 ± 0.33 78.99 ± 0.10 78.74 ± 0.84 ::::: 82.01 :: ± ::: 0.64 83.09 ± 0.59 15 54.69 ± 0.52 ::::: 76.53 :: ± ::: 0.29 : 74.76 ± 0.44 78.59 ± 0.09 73.62 ± 0.85 64.59 ± 2.69 73.71 ± 1.82 20 40.24 ± 1.97 ::::: 76.46 :: ± ::: 0.58 : 53.94 ± 0.45 77.58 ± 0.13 72.72 ± 0.88 52.18 ± 0.71 73.65 ± 1.42 25 44.13 ± 3.42 ::::: 75.95 :: ± ::: 0.50 71.98 ± 0.17 77.03 ± 0.11 74.91 ± 0.56 53.05 ± 0.36 75.36 ± 1.34 ::: 0.10 89.62 ± 0.41 87.51 ± 0.42 91.60 ± 0.29 90.11 ± 0.57 91.45 ± 0.21 92.56 ± 0.42 5 84.29 ± 0.57 84.64 ± 0.27 85.29 ± 1.13 82.64 ± 1.34 88.25 ± 1.19 ::::: 90.10 :: ± ::: 0.27 90.60 ± 0.28 10 81.71 ± 0.61 81.12 ± 0.31 84.59 ± 0.68 80.32 ± 1.29 88.14 ± 0.60 ::::: 89.45 :: ± ::: 0.41 90.10 ± 0.35 15 79.65 ± 1.00 74.66 ± 0.94 83.81 ± 0.81 77.44 ± 1.46 87.59 ± 0.74 ::::: 89.23 :: ± ::: 0.34 89.93 ± 0.51 20 79.95 ± 0.50 74.26 ± 0.75 82.35 ± 1.64 77.38 ± 1.55 87.83 ± 1.03 ::::: 88.65 :: ± ::: 0.35 90.61 ± 0.28 25 79.55 ± 1.16 74.12 ± 0.81 82.04 ± 0.99 77.60 ± 1.60 88.06 ± 0.85

Node classification performance (accuracy ± std) under targeted attack (Bold: the best model; :::: wavy: the runner-up model) ± 1.45 81.95 ± 0.29 77.35 ± 1.40 77.59 ± 1.34 82.92 ± 0.29 84.93 ± 2.28 ::::: 83.01 :: ± ::: 1.57 1 78.19 ± 1.66 75.30 ± 1.54 75.18 ± 1.80 76.67 ± 2.93 ::::: 81.48 :: ± ::: 0.91 81.44 ± 1.81 81.57 ± 1.18 ± 1.02 76.98 ± 1.77 73.02 ± 6.77 78.36 ± 5.53 80.63 ± 0.95 ::::: 81.01 :: ± ::: 0.50 81.11 ± 1.32 3 66.51 ± 3.36 74.76 ± 1.31 76.03 ± 3.71 78.09 ± 3.39 79.36 ± 4.76 ::::: 80.31 :: ± ::: 1.10 80.32 ± 1.90 4 62.54 ± 1.62 76.34 ± 1.49 62.22 ± 3.31 ::::: 77.69 :: ± ::: 4.63 75.71 ± 4.87 72.06 ± 5.60 80.16 ± 1.28 5 52.70 ± 1.98 72.85 ± 1.65 60.16 ± 6.67 ::::: 75.37 :: ± ::: 3.15 : 73.95 ± 7.13 73.96 ± 3.90 80.75 ± 1.70 ± 0.56 84.91 ± 0.24 83.06 ± 0.30 81.82 ± 0.45 86.64 ± 0.80 ::::: 88.84 :: ± ::: 0.67 88.88 ± 0.31 1 83.85 ± 0.47 83.82 ± 0.33 80.51 ± 0.28 82.22 ± 0.66 83.77 ± 0.96 ::::: 85.87 :: ± ::: 1.01 86.20 ± 1.21 2 79.18 ± 1.31 83.75 ± 0.33 78.73 ± 0.34 82.35 ± 0.66 82.29 ± 0.13 ::::: 84.34 :: ± ::: 1.02 84.41 ± 1.51 3 76.19 ± 0.82 82.18 ± 0.48 78.23 ± 0.20 ::::: 82.39 :: ± ::: 0.33 80.58 ± 1.06 81.41 ± 1.62 83.50 ± 1.08 4 70.61 ± 1.56 81.90 ± 0.45 76.25 ± 0.51 80.50 ± 0.61 79.92 ± 1.44 78.31 ± 1.20 ::::: 80.81 :: ± ::: 0.99 5 64.52 ± 1.29 81.35 ± 0.34 75.47 ± 0.41 79.86 ± 0.54 77.63 ± 1.68 74.08 ± 1.89 ::::: 80.17 :: ± ::: 3.85 ACM 0 90.33 ± 0.09 89.76 ± 0.39 86.21 ± 0.46 ::::: 90.36 :: ± ::: 0.19 90.22 ± 0.60 90.31 ± 0.71 92.11 ± 0.41 1 89.70 ± 0.22 83.62 ± 0.47 83.50 ± 0.71 89.09 ± 0.35 87.09 ± 0.65 ::::: 90.03 :: ± ::: 0.43 91.08 ± 1.36 2 82.06 ± 1.12 80.47 ± 0.63 82.09 ± 0.73 81.86 ± 1.99 87.01 ± 0.53 ::::: 87.72 :: ± ::: 1.27 90.55 ± 0.47 3 80.26 ± 1.17 77.07 ± 0.67 81.09 ± 0.78 79.24 ± 1.91 ::::: 87.07 :: ± ::: 1.14 84.75 ± 2.07 89.54 ± 0.49 4 76.86 ± 1.46 77.45 ± 0.44 80.76 ± 0.54 74.88 ± 2.04 ::::: 87.04 :: ± ::: 1.16 83.49 ± 2.01 89.45 ± 0.51 5 73.32 ± 1.77 74.38 ± 0.59 79.74 ± 0.77 71.54 ± 1.97 ::::: 86.42 :: ± ::: 0.71 81.67 ± 1.53 88.44 ± 0.71

Performance comparison of GCN and GCN-AS. ± 1.37 52.70 ± 1.98 64.52 ± 1.29 73.32 ± 1.77 GCN-AS 67.91 ± 9.28 69.96 ± 5.30 79.63 ± 7.98 87.56 ± 1.49

Runtime (in seconds) for 200 epochs of training.

Learned coefficients of ASGNN on Cora under targeted attack

Sparsity level of the graphs leant in ASGNN. %) 77.27 ± 18.39 89.40 ± 0.95 78.54 ± 12.53 46.56 ± 3.24 E.4 CONVERGENCE PROPERTY OF ASMP IN PRACTICE

annex

In conclusion, there exists a subgradient ψ ∈ ∂h ij (S ij ) such that ψ(S ij -S ij ) ≥ 0 for all 0 ≤ S ij ≤ 1, based on which the optimality of Eq. ( 10) is validated and the proof is completed.Based on the result in Lemma 5, by choosing M = S (k) -η 2 ∇f S (S (k) ) and κ = η 2 µ 1 , we get the analytical expression for the proximal step in Eq. ( 7) as follows:, which suffices to perform the soft-thresholding operation and then project the solution onto the constraint set S.C PROOF OF THEOREM 4 (CONVERGENCE OF ASMP)In this section, we will first prove that the objective function at the H-block optimization problem and the smooth part of the objective function at the S-block optimization problem are L-smooth.Then we give the conditions to ensure convergence of ASMP.Denote f H (H) as the objective function at the H-block optimization problem, i.e., , 1997) . The largest eigenvalue of a random walk normalized Laplacian matrix L rw is less than or equal to 2, i.e., ||L rw || 2 ≤ 2.Based on Lemma 7, we can conclude that4λ and the proof is completed With f S defined in Eq. ( 9), the L-smoothness of f S is deomnstrated in the following lemma.Proof. Denote D 1 = Diag (S 1 1) and D 2 = Diag (S 2 1) as two degree matrices corresponding to S 1 and S 2 . We have(13)

