GEOMETRY-AWARE INSTANCE-REWEIGHTED ADVER-SARIAL TRAINING

Abstract

In adversarial machine learning, there was a common belief that robustness and accuracy hurt each other. The belief was challenged by recent studies where we can maintain the robustness and improve the accuracy. However, the other direction, we can keep the accuracy and improve the robustness, is conceptually and practically more interesting, since robust accuracy should be lower than standard accuracy for any model. In this paper, we show this direction is also promising. Firstly, we find even over-parameterized deep networks may still have insufficient model capacity, because adversarial training has an overwhelming smoothing effect. Secondly, given limited model capacity, we argue adversarial data should have unequal importance: geometrically speaking, a natural data point closer to/farther from the class boundary is less/more robust, and the corresponding adversarial data point should be assigned with larger/smaller weight. Finally, to implement the idea, we propose geometry-aware instance-reweighted adversarial training, where the weights are based on how difficult it is to attack a natural data point. Experiments show that our proposal boosts the robustness of standard adversarial training; combining two directions, we improve both robustness and accuracy of standard adversarial training.

1. INTRODUCTION

Crafted adversarial data can easily fool the standard-trained deep models by adding humanimperceptible noise to the natural data, which leads to the security issue in applications such as medicine, finance, and autonomous driving (Szegedy et al., 2014; Nguyen et al., 2015) . To mitigate this issue, many adversarial training methods employ the most adversarial data maximizing the loss for updating the current model such as standard adversarial training (AT) (Madry et al., 2018) , TRADES (Zhang et al., 2019) , robust self-training (RST) (Carmon et al., 2019), and MART (Wang et al., 2020b) . The adversarial training methods seek to train an adversarially robust deep model whose predictions are locally invariant to a small neighborhood of its inputs (Papernot et al., 2016) . By leveraging adversarial data to smooth the small neighborhood, the adversarial training methods acquire adversarial robustness against adversarial data but often lead to the undesirable degradation of standard accuracy on natural data (Madry et al., 2018; Zhang et al., 2019) . Thus, there have been debates on whether there exists a trade-off between robustness and accuracy. For example, some argued an inevitable trade-off: Tsipras et al. ( 2019 Both methods can improve the accuracy while maintaining the robustness. However, the other direction-whether we can improve the robustness while keeping the accuracy-remains unsolved and is more interesting. In this paper, we show this direction is also achievable. Firstly, we show over-parameterized deep networks may still have insufficient model capacity, because adversarial training has an overwhelming smoothing effect. Fitting adversarial data is demanding for a tremendous model capacity: It requires a large number of trainable parameters or long-enough training epochs to reach near-zero error on the adversarial training data (see Figure 2 ). The over-parameterized models that fit natural data entirely in the standard training (Zhang et al., 2017) are still far from enough for fitting adversarial data. Compared with standard training fitting the natural data points, adversarial training smooths the neighborhoods of natural data, so that adversarial data consume significantly more model capacity than natural data. Thus, adversarial training methods should carefully utilize the limited model capacity to fit the neighborhoods of the important data that aid to fine-tune the decision boundary. Therefore, it may be unwise to give equal weights to all adversarial data. Secondly, data along with their adversarial variants are not equally important. Some data are geometrically far away from the class boundary. They are relatively guarded. Their adversarial variants are hard to be misclassified. On the other hand, some data are close to the class boundary. They are relatively attackable. Their adversarial variants are easily misclassified (see Figure 3 ). As the adversarial training progresses, the adversarially robust model engenders an increasing number of guarded training data and a decreasing number of attackable training data. Given limited model capacity, treating all data equally may cause the vast number of adversarial variants of the guarded data to overwhelm the model, leading to the undesirable robust overfitting (Rice et al., 2020) . Thus, it may be pessimistic to treat all data equally in adversarial training. To ameliorate this pessimism, we propose a heuristic method, i.e., geometry-aware instancereweighted adversarial training (GAIRAT). As shown in Figure 1 , GAIRAT treats data differently. Specifically, for updating the current model, GAIRAT gives larger/smaller weight to the loss of an adversarial variant of attackable/guarded data point which is more/less important in fine-tuning the decision boundary. An attackable/guarded data point has a small/large geometric distance, i.e., its distance from the decision boundary. We approximate its geometric distance by the least number of iterations κ that projected gradient descent method (Madry et al., 2018) requires to generate a misclassified adversarial variant (see the details in Section 3.3). GAIRAT explicitly assigns instancedependent weight to the loss of its adversarial variant based on the least iteration number κ. Our contributions are as follows. (a) In adversarial training, we identify the pessimism in treating all data equally, which is due to the insufficient model capacity and the unequal nature of different data (in Section 3.1). (b) We propose a new adversarial training method, i.e., GAIRAT (its learning objective in Section 3.2 and its realization in Section 3.3). GAIRAT is a general method: Besides standard AT (Madry et al., 2018) , the existing adversarial training methods such as FAT (Zhang et al., 2020b) and TRADES (Zhang et al., 2019) can be modified to GAIR-FAT and GAIR-TRADES (in Appendices B.1 and B.2, respectively). (c) Empirically, our GAIRAT can relieve the issue of robust



Figure 1: The illustration of GAIRAT. GAIRAT explicitly gives larger weights on the losses of adversarial data (larger red), whose natural counterparts are closer to the decision boundary (lighter blue). GAIRAT explicitly gives smaller weights on the losses of adversarial data (smaller red), whose natural counterparts are farther away from the decision boundary (darker blue). The examples of two toy datasets and the CIFAR-10 dataset refer to Figure 3. Recently, emerging adversarial training methods have empirically challenged this trade-off. For example, Zhang et al. (2020b) proposed the friendly adversarial training method (FAT), employing friendly adversarial data minimizing the loss given that some wrongly-predicted adversarial data have been found. Yang et al. (2020) introduced dropout (Srivastava et al., 2014) into existing AT, RST, and TRADES methods.Both methods can improve the accuracy while maintaining the robustness. However, the other direction-whether we can improve the robustness while keeping the accuracy-remains unsolved and is more interesting.

