FLAG: ADVERSARIAL DATA AUGMENTATION FOR GRAPH NEURAL NETWORKS

Abstract

Data augmentation helps neural networks generalize better, but it remains an open question how to effectively augment graph data to enhance the performance of GNNs (Graph Neural Networks). While most existing graph regularizers focus on augmenting graph topological structures by adding/removing edges, we offer a novel direction to augment in the input node feature space for better performance. We propose a simple but effective solution, FLAG (Free Large-scale Adversarial Augmentation on Graphs), which iteratively augments node features with gradient-based adversarial perturbations during training, and boosts performance at test time. Empirically, FLAG can be easily implemented with a dozen lines of code and is flexible enough to function with any GNN backbone, on a wide variety of large-scale datasets, and in both transductive and inductive settings. Without modifying a model's architecture or training setup, FLAG yields a consistent and salient performance boost across both node and graph classification tasks. Using FLAG, we reach state-of-the-art performance on the large-scale ogbg-molpcba, ogbg-ppa, and ogbg-code datasets.

1. INTRODUCTION

Graph Neural Networks (GNNs) have emerged as powerful architectures for learning and analyzing graph representations. The Graph Convolutional Network (GCN) (Kipf & Welling, 2016 ) and its variants have been applied to a wide range of tasks, including visual recognition (Zhao et al., 2019; Shen et al., 2018) , meta-learning (Garcia & Bruna, 2017) , social analysis (Qiu et al., 2018; Li & Goldwasser, 2019) , and recommender systems (Ying et al., 2018) . However, the training of GNNs on large-scale datasets usually suffers from overfitting, and realistic graph datasets often involve a high volume of out-of-distribution test nodes (Hu et al., 2020) , posing significant challenges for prediction problems. One promising solution to combat overfitting in deep neural networks is data augmentation (Krizhevsky et al., 2012) , which is commonplace in computer vision tasks. Data augmentations apply label-preserving transformations to images, such as translations and reflections. As a result, data augmentation effectively enlarges the training set while incurring negligible computational overhead. However, it remains an open problem how to effectively generalize the notion of data augmentation to GNNs. Transformations on images rely heavily on image structures, and it is challenging to design low-cost transformations that preserve semantic meaning for non-visual tasks like natural language processing (Wei & Zou, 2019) and graph learning. Generally speaking, graph data for machine learning comes with graph structure (or edge features) and node features. In the limited cases where data augmentation can be done on graphs, it generally focuses exclusively on the graph structure by adding/removing edges (Rong et al., 2019) . To date, there is no study on how to manipulate graphs in node feature space for enhanced performance. In the meantime, adversarial data augmentation, which happens in the input feature space, is known to boost neural network robustness and promote resistance to adversarially chosen inputs (Goodfellow et al., 2014; Madry et al., 2017) . Despite the wide belief that adversarial training harms standard generalization and leads to worse accuracy (Tsipras et al., 2018; Balaji et al., 2019) , recently a growing amount of attention has been paid to using adversarial perturbations to augment datasets and ultimately alleviate overfitting. For example, Volpi et al. (2018) showed adversarial data augmentation is a data-dependent regularization that could help generalize to out-of-distribution samples, and its effectiveness has been verified in domains including computer vision (Xie et al., 2020) , language understanding (Zhu et al., 2019; Jiang et al., 2019) , and visual question answering (Gan et al., 2020) . Despite the rich literature about adversarial training of GNNs for security purposes (Zügner et al., 2018; Dai et al., 2018; Bojchevski & Günnemann, 2019; Zhang & Zitnik, 2020) , it remains unclear how to effectively and efficiently improve GNN's clean accuracy using adversarial augmentation. Present work. We propose FLAG, Free Large-scale Adversarial Augmentation on Graphs, to tackle the overfitting problem. While existing literature focuses on modifying graph structures to augment datasets, FLAG works purely in the node feature space by adding gradient-based adversarial perturbations to the input node features with graph structures unchanged. FLAG leverages "free" methods (Shafahi et al., 2019) to conduct efficient adversarial training so that it is highly scalable on large-scale datasets. We verify the effectiveness of FLAG on the Open Graph Benchmark (OGB) (Hu et al., 2020) , which is a collection of large-scale, realistic, and diverse graph datasets for both node and graph property prediction tasks. We conduct extensive experiments across OGB datasets by applying FLAG to prestigious GNN models, which are GCN, GraphSAGE, GAT, and GIN (Kipf & Welling, 2016; Hamilton et al., 2017; Veličković et al., 2017; Xu et al., 2019) and show that FLAG brings consistent and significant improvements. For example, FLAG lifts the test accuracy of GAT on ogbn-products by an absolute value of 2.31%. DeeperGCN (Li et al., 2020 ) is another strong baseline that achieves top performance on several OGB benchmarks. FLAG enables DeeperGCN to generalize further and reach new state-of-the-art performance on ogbg-molpcba and ogbg-ppa. FLAG is simple (adding just a dozen lines of code), general (can be directly applied to any GNN model), versatile (works in both transductive and inductive settings), and efficient (able to bring salient improvement at tractable or even no extra cost). Our main contributions are summarized as follows: • We propose adversarial perturbations as a data augmentation in the input node feature space to efficiently boost GNN performance. The resulting FLAG framework is a scalable and flexible augmentation scheme for GNN, which is easy to implement and applicable to any GNN architecture for both node and graph classification tasks. • We advance the state-of-the-art on a number of large-scale OGB datasets, often by large margins. • We provide a detailed analysis and deep insights on the effects adversarial augmentation has on GNNs.

2. PRELIMINARIES

Graph Neural Networks (GNNs). We denote a graph as G(V, E) with initial node features x v for v ∈ V and edge features e uv for (u, v) ∈ E. GNNs are built on graph structures to learn representation vectors h v for every node v ∈ V and a vector h G for the entire graph G. The k-th iteration of message passing, or the k-th layer of GNN forward computation is: h (k) v = COMBINE (k) h (k-1) v , AGGREGATE (k) h (k-1) v , h (k-1) u , euv : u ∈ N (v) , where h (k) v is the embedding of node v at the k-th layer, e uv is the feature vector of the edge between node u and v, N (v) is node v's neighbor set, and h (0) v = x v . COMBINE(•) and AGGREGATE(•) are functions parameterized by neural networks. To simplify, we view the holistic message passing pipeline as an end-to-end function f θ (•) built on graph G: H (K) = f θ (X; G), ( ) where X is the input node feature matrix. After K rounds of message passing we get the final-layer node matrix H (K) . To obtain the representation of the entire graph h G , the permutation-invariant READOUT(•) function pools node features from the final iteration K as: h G = READOUT h (K) v | v ∈ V , Additionally from the spectral convolution point of view, the k-th layer of GCN is: I + D -1 2 AD -1 2 → D-1 2 Ã D-1 2 , S = D-1 2 Ã D-1 2 , H (k+1) = σ SH (k) Θ (k) , where H (k) is the node feature matrix of the k-th layer with H 0 = X, Θ k is the trainable weight matrix of layer k, and σ is the activation function. D and A denote the diagonal degree matrix and adjacency matrix, respectively. Here, we view S as a normalized adjacency matrix with self-loops added. Adversarial training. Standard adversarial training seeks to solve the min-max problem as: min θ E (x,y)∼D max δ p ≤ L (f θ (x + δ), y) , ( ) where D is the data distribution, y is the label, • p is some p -norm distance metric, is the perturbation budget, and L is the objective function. Madry et al. (2017) showed that this saddlepoint optimization problem could be reliably tackled by Stochastic Gradient Descent (SGD) for the outer minimization and Projected Gradient Descent (PGD) for the inner maximization. In practice, the typical approximation of the inner maximization under an l ∞ -norm constraint is as follows, δ t+1 = Π δ ∞≤ (δ t + α • sign (∇ δ L (f θ (x + δ t ), y))) , where perturbation δ is updated iteratively, and Π δ ∞ ≤ performs projection onto the -ball in the l ∞ -norm. For maximum robustness, this iterative updating procedure usually loops M times, which makes PGD computationally expensive. While there are M forward and backward steps within the process, θ gets updated just once using the final δ M .

3. PROPOSED METHOD: FLAG

Adversarial training is a form of data augmentation. By hunting for and stamping out small perturbations that cause the classifier to fail, one may hope that adversarial training should be beneficial to standard accuracy (Goodfellow et al., 2014; Tsipras et al., 2018; Miyato et al., 2018) . With an increasing amount of attention paid to leverage adversarial training for better clean performance in varied domains (Xie et al., 2020; Zhu et al., 2019; Gan et al., 2020) , we conduct the first study on how to effectively generalize GNNs using adversarial data augmentation. Here we introduce FLAG, Free Large-scale Adversarial Augmentation on Graphs, to best exploit the power of adversarial augmentation. Note that our method differs from other augmentations for graphs in that it happens in the input node feature space. Augmentation for "free". We leverage the "free" adversarial training method (Shafahi et al., 2019) to craft adversarial data augmentations. PGD is a strong but inefficient way to solve the inner maximization of (6). While computing the gradient for the perturbation δ, free training simultaneously computes the model parameter θ's gradient. This "free" parameter gradient is then used to compute the ascent step. The authors proposed to train on the same minibatch M times in a row to simulate the inner maximization in (6), while compensating by performing M times fewer epochs of training. The resulting algorithm yields accuracy and robustness competitive with standard adversarial training, but with the same runtime as clean training. Gradient accumulation. When doing "free" adversarial training, the inner/adversarial loop is usually run M times, each time computing both the gradient for δ t and θ t-1 . Rather than updating the model parameters in each loop, Zhang et al. (2019) proposed to accumulate the gradients for θ t-1 during the inner loop and applied them all at once during the outer/parameter update. The same idea was used by Zhu et al. (2019) , who proposed FreeLB to tackle this optimization issue on language understanding tasks. FreeLB ran multiple PGD steps to craft adversaries, and meanwhile accumulated the gradients ∇ θ L of model parameters. The gradient accumulation behavior can be approximated as optimizing the objective below: min θ E (x,y)∼D 1 M M -1 t=0 max δt∈It L (f θ (x + δ t ) , y) , where  I t = B x+δ0 (αt) ∩ B x ( ). δ 0 ← U (-α, α) initialize from uniform distribution 4: g 0 ← 0 5: for t = 1 . . . M do 6: g t ← g t-1 + 1 M • ∇ θ L (f θ (X + δ t-1 ; G), y) θ gradient accumulation 7: g δ ← ∇ δ L (f θ (X + δ t-1 ; G) , y) 8: δ t ← δ t-1 + α • g δ / g δ F perturbation δ gradient ascent 9: end for 10: θ ← θ -τ • g M model parameter θ gradient descent 11: end for Unbounded attack. Usually on images, the inner maximization is a constrained optimization problem. The largest perturbation one can add is bounded by the hyperparameter , typically 8/255 under the l ∞ -norm. This encourages the visual imperceptibility of the perturbations, thus making defenses realistic and practical. However, graph node features or language word embeddings do not have such straightforward semantic meanings, which makes the selection of highly heuristic. In light of the positive effect of large perturbations on generalization (Volpi et al., 2018) , and also to simplify hyperparameter search, FLAG drops the projection step when performing the inner maximization. Note that, although the perturbation is not bounded by an explicit , it is still implicitly bounded in the furthest distance that δ can reach, i.e. the step size α times the number of ascending steps M . Biased perturbation for node classification. Conventional conv nets treat each test sample independently during inference, whereas this is not the case in transductive graph learning scenarios. When classifying one target node, messages from the whole k-hop neighborhood are aggregated and combined into its embedding. It is natural to believe that a further neighbor should have lower impact, i.e. higher smoothness, on the final decision of the target node, which can also be intuitively reflected by the message passing view of GNNs in (1). To promote more invariance for furtheraway neighbors when doing node classification, we perturb unlabeled nodes with larger step sizes α u than α l for target nodes. We show the effectiveness of this biased perturbation in the ablation study section. The overall augmentation pipeline is presented in Algorithm 1. Note that when doing transductive node classification, we use diverse step sizes α l and α u to craft adversarial augmentation for target and unlabeled nodes, respectively. In the following sections, we verify FLAG's effectiveness through extensive experiments. In addition, we provide detailed discussions for a deep understanding of the effects of adversarial augmentation.

4. EXPERIMENTS

In this section, we demonstrate FLAG's effectiveness through extensive experiments on the Open Graph Benchmark (OGB), which consists of a wide range of challenging large-scale datasets. Shchur et al. (2018) ; Errica et al. (2019); Dwivedi et al. (2020) showed that traditional graph datasets suffered from problems such as unrealistic and arbitrary data splits, highly limited data sizes, nonrigorous evaluation metrics, and common neglect of cross-validation, etc. In order to empirically study FLAG's effects in a fair and reliable manner, we conduct experiments on the newly released OGB (Hu et al., 2020) datasets, which have tackled those major issues and brought more realistic challenges to the graph research community. We refer readers to Hu et al. (2020) for detailed information on the OGB datasets. Unless otherwise stated, all of the baseline test statistics come from the official OGB leaderboard website, and we conduct all of our experiments using publicly released implementations without touching the original model architecture or training setup. We report mean and std values from ten runs with different random seeds. Following common practice on this benchmark, we report the test performance associated with the best validation result. We choose the prestigious GCN, GraphSAGE, GAT, and GIN as our baseline models. In addition, we apply FLAG to the recent DeeperGCN model to demonstrate effectiveness. Our implementation always uses M = 3 ascent steps for simplicity. Following Goodfellow et al. (2014) ; Madry et al. (2017) , we use sign(•) for gradient normalization. We leave exhaustive hyperparameter and normalization search for future research. All training hyperparameters and evaluation results can be found in the Appendix. Node Property Prediction. We summarize the results of node classification in Table 1 . On ogbn-products, GraphSAGE, GAT, and DeeperGCN all receive promising results with FLAG. We adopt neighbor sampling (Hamilton et al., 2017) as the mini-batch algorithm for GraphSAGE and GAT to make the experiments scalable. For DeeperGCN, we follow the original setup by Li et al. (2020) to randomly split the graph into clusters. Notably, FLAG yields a 2.31% test accuracy lift for GAT, making GAT competitive on the ogbn-products dataset. Because the graph size of ogbn-proteins is small, all models are trained in a full-batch manner. From Table 1 we can see that FLAG further enhances the performance of DeeperGCN but harms that of GCN and Graph-SAGE. Considering the dataset's specialty of not having input node features, we provide detailed discussions on the effect of different node feature constructions later. We also do full-batch training on ogbn-arxiv, where FLAG enables GAT and DeeperGCN to reach 73.71% and 72.14% accuracy. Note that the GAT baseline is from the DGL (Wang et al., 2019) implementation, which differs from vanilla GAT with batch norm and label propagation incorporated. We reveal batch norm's influence in the discussion. ogbn-mag is a heterogeneous network where only "paper" nodes come with node features. We use the neighbor sampling mini-batch algorithm to train R-GCN and report its results in the right part of Table 2 . Surprisingly, FLAG can also directly bring nontrivial accuracy improvement without special designs for heterogeneous graphs, which demonstrates its versatility. Graph Property Prediction. Table 3 summarizes the test scores of GCN, GIN, and DeeperGCN on all four OGB graph property prediction datasets. "Virtual" means the model is augmented with virtual nodes (Li et al., 2017; Gilmer et al., 2017; Hu et al., 2020) . As adversarial perturbations are crafted by gradient ascent, it would be unnatural to perturb discrete input node features. Following Jin & Zhang (2019) ; Zhu et al. (2019) , we firstly project discrete node features into the continuous space and then adversarially augment the hidden embeddings. On ogbg-molhiv, FLAG yields notable improvements, but when GCN has already been hurt by virtual nodes, FLAG appears to 

5. ABLATION STUDIES AND DISCUSSIONS

Effects of biased perturbation. From the left part of Table 2 , we see that there is a salient increase of accuracy when using a larger perturbation on unlabeled nodes, which verifies the effectiveness of biased perturbations. Comparison with other adversarial training methods. The right part of Table 4 shows GAT's performance with different adversarial augmentations. For PGD and Free, we compute 8 ascent steps for the inner-maximization, while for FreeLB and FLAG we compute 3 steps. FLAG outperforms all other methods by a large margin. Compatibility with mini-batch methods. Graph mini-batch algorithms are critical to training GNNs on large-scale datasets. We test how different algorithms will work with adversarial data augmentation with GraphSAGE as the backbone. From the left part of Table 4 , we see that neighbor sampling (Hamilton et al., 2017) and GraphSAINT (Zeng et al., 2019 ) can all work with FLAG to further boost performance, while Cluster (Chiang et al., 2019) suffers an accuracy drop. Compatibility with batch norm. The left part of Table 5 shows that batch norm works to generalize GAT, and FLAG works to push the improvement further. In the computer vision domain, Xie et al. (2020) proposed a new batch norm method that makes adversarial training further generalize largescale CNN models. As there is growing attention on using batch norm on GNNs, it will also be interesting to see how to synergize adversarial augmentation with batch norm in future architectures.

ogbn-arxiv

Compatibility with dropout. Dropout is widely used in GNNs. The right part of Table 5 shows that, when trained without dropout, GAT accuracy drops steeply by a large margin. What's more, FLAG can further generalize GNN models together with dropout, similar to the phenomenon of image augmentations. Towards going "free". FLAG introduces tractable extra training overhead. We empirically show that, when we decrease the total training epochs to make it as fast as the standard GNN training pipeline, FLAG still brings significant performance gains. The left part of Table 2 shows that FLAG with fewer epochs still generalizes the baseline. Empirically, on a single Nvidia RTX 2080Ti, 100epoch vanilla GAT takes 88 mins, while FLAG in Table 2 takes 91 mins. We note that heuristics like early stopping and cyclic learning rates can further accelerate the adversarial training process (Wong et al., 2020) , so there are abundant opportunities for further research on adversarial augmentation at lower or even no cost. Towards going deep. Over-smoothing stops GNNs from going deep. FLAG shows its ability to boost both shallow and deep baselines, e.g. GCN and DeeperGCN. In the left part of Figure 1 , we show FLAG's effects on generalization when a GNN goes progressively deeper. The experiments are conducted on ogbn-arxiv with GraphSAGE as the backbone, where a consistent improvement is evident. What if there's no node feature? One natural question can be raised: what if no input node features are provided? ogbn-proteins is a dataset without input node features. Hu et al. (2020) proposed to average incoming edge features to obtain initial node features, while Li et al. (2020) used summation and achieved competitive results. Note that the GCN and GraphSAGE baselines in Table 1 use the "mean" node features as input and suffer an accuracy drop with FLAG; DeeperGCN leverages the "sum" and gets further improved. Interestingly, when DeeperGCN is trained with "mean" node features, it receives high invariance, so that even large magnitude perturbations will not change its result. The diverse behavior of adversarial augmentation implies the importance of node feature construction method selection.

6. WHERE DOES THE BOOST COME FROM?

It is now widely believed that model robustness appears to be at odds with clean accuracy. Despite the proliferation of literature in using adversarial data augmentation to promote standard performance, it is still unsettled where the boost or detriment of adversarial training comes from. Data distribution is the key. We conjecture that the diverse effects of adversarial training in different domains stem from differences in the input data distribution rather than model architectures. To ground our claim, we utilize FLAG to augment MLPs (an architecture where adversarial training has adverse effects in the image domain) on ogbn-arxiv, and successfully boost generalization. FLAG directly improves the test accuracy from 55.50 ± 0.23% to 56.02 ± 0.19%. In general, adversarial training hurts the clean accuracy in image classification, but Tsipras et al. (2018) showed that CNNs could benefit from adversarial augmentations on MNIST. This is consistent with our guess that model architecture has little to do with the performance using adversarial augmentation. Like one-hot word embeddings for language models, input node features usually come from discrete spaces, e.g., bag-of-words binary features in ogbn-products. We believe that using discrete vs. continuous input features may lead to different adversarial augmentation behavior. We provide a simple example on the Cora (Getoor, 2005) dataset to illustrate. We choose the classic FGSM to craft adversarial augmentation and GCN as the backbone. By adding Gaussian noise with std δ, we simulate node features drawn from a continuous distribution. The result is summarized in the right part of Figure 1 . When δ = 0, the discrete distribution of node features persists. At this moment, GCN with adversarial augmentation outperforms the clean model. With increased noise magnitude δ, the features are continuously distributed with large support and FGSM starts to harm the clean accuracy, which validates our conjecture.

7. RELATED WORK

Existing graph regularizers mainly focus on augmenting graph structures by modifying edges (Rong et al., 2019; Hamilton et al., 2017; Chen et al., 2018) . We propose to effectively augment graph data using adversarial perturbations. On large-scale image classification tasks, Xie et al. (2020) leveraged adversarial perturbations, along with new batch norm methods, to augment data. Zhu et al. (2019) ; Jiang et al. (2019) added adversarial perturbations in the embedding space and generalized language models further in the fine-tuning phase. Gan et al. (2020) showed that VQA model accuracy was further improved by adversarial augmentation. To clarify, FLAG is intrinsically different from the previous graph adversarial training methods (Feng et al., 2019; Deng et al., 2019; Jin & Zhang, 2019) . Feng et al. (2019) proposed to reinforce local smoothness to make embeddings within communities similar. All three methods assigned pseudo-labels to test nodes during training time and utilized virtual adversarial training (Miyato et al., 2018) to make test node predictions similar to their pseudo-labels. This makes them workable for semi-supervised settings, but not for inductive tasks. Besides the original classification loss term, they all introduced KL loss into the final objective functions, which would at least double the GPU memory usage and make training less efficient and less scalable. In contrast, FLAG requires minimal extra space overhead and can directly work in the original training setup.

8. CONCLUSION

We propose FLAG (Free Large-scale Adversarial Augmentation on Graphs), a simple, scalable, and general data augmentation method for better GNN generalization. Like widely-used image augmentations, FLAG can be easily incorporated into any GNN training pipeline. FLAG yields consistent improvement over a range of GNN baselines, and reaches state-of-the-art performance on the large-scale ogbg-molpcba, ogbg-ppa, and ogbg-code datasets. Besides extensive experiments, we also provide conceptual analysis to validate adversarial augmentation's different behavior on varied data types. The effects of adversarial augmentation on generalization are still not entirely understood, and we think this is a fertile space for future exploration. 



Figure 1: Left: Test accuracy on ogbn-arxiv. Right: Test accuracy on the Cora dataset.

.detach() + alpha * torch.sign(pert.grad.detach())

Node property prediction test performance on ogbn-products, ogbn-proteins, and ogbn-arxiv datasets. Blank denotes no statistics on the leaderboard.



Graph property test performance on ogbg-molhiv, ogbg-molpcba, ogbg-ppa, and ogbg-code datasets. denotes state-of-the-art performance on the OGB leaderboard; denotes the existence of virtual nodes; blank denotes no statistics on the leaderboard.

Left: Test accuracy on ogbn-products with GraphSAGE trained with diverse minibatch algorithms. Right: Test performance on ogbn-products with GAT trained with different adversarial augmentations. exaggerate the harm. Note that the test results on ogbg-molhiv all have relatively high variance compared with others, where randomness in the test result is more severe. On ogbg-molpcba, GIN-Virtual with FLAG receives an absolute value 1.31% test AP value increase, and DeeperGCN is further enhanced to retain its SOTA performance. On ogbg-ppa, FLAG further generalizes DeeperGCN and registers a new state-of-the-art test accuracy of 77.52%. On ogbg-code, FLAG boosts GCN-Virtual to a state-of-the-art test F1 score of 33.16. Besides node classification, FLAG's strong effects on graph classification prove its high versatility. In most cases, FLAG works well with virtual node augmentation to further enhance graph learning.

Left: Test Accuracy on the ogbn-arxiv dataset. Right: Test Accuracy on the ogbn-products dataset.

annex

Here we summarize our main experiment results on both node and graph classification tasks. Hyperparameters for crafting adversarial augmentations are listed in the table. For other training setups of backbones, we refer readers to the public website of the OGB leaderboard. 

