REVISITING INSTANCE-REWEIGHTED ADVERSARIAL TRAINING

Abstract

Instance-reweighted adversarial training (IRAT) is a type of adversarial training that assigns large weights to high-importance examples and then minimizes the weighted loss. The importance often uses the margins between decision boundaries and each example. In particular, IRAT can alleviate robust overfitting and obtain excellent robustness by computing margins with an estimated probability. However, previous works implicitly dealt with binary classification even in the multi-class cases, because they computed margins with only the true class and the most confusing class. The computed margins can become equal even with different true probability examples, because of the complex decision boundaries in multi-class classification. In this paper, first, we clarify the above problem with a specific example. Then, we propose margin reweighting, which can transform the previous margins into appropriate representations for multi-class classification by leveraging the relations between the most confusing class and other classes. Experimental results on the CIFAR-10/100 datasets demonstrate that the proposed method is effective in boosting the robustness against several attacks as compared to the previous methods.

1. INTRODUCTION

While convolutional neural networks (CNNs) achieve excellent performance on various tasks, they are vulnerable to adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2015) with malicious, perturbed samples. Such perturbation is a threat against CNN-based AI systems (e.g., those for autonomous driving or medical diagnosis) because it is imperceptible to humans. Thus, there are various effective approaches to mitigate the negative impact of perturbation (Papernot et al., 2016; Samangouei et al., 2018; Xu et al., 2018; Madry et al., 2018) . Among them, adversarial training (AT) (Madry et al., 2018) is well known as an attractive defense strategy because of its clarity and efficacy. Instead of using benign examples, AT trains adversarial examples generated by projected gradient descent (PGD) (Madry et al., 2018) . Although AT can achieve excellent robustness, it can also exhibit performance degradation for benign examples or robust overfitting (Zhang et al., 2019; Rice et al., 2020) . Instance-reweighted adversarial training (IRAT) (Zeng et al., 2021; Kim et al., 2021; Zhang et al., 2021; Wang et al., 2021; Gao et al., 2021) is an effective method among the many approaches developed to overcome these issues. IRAT computes the margins between the decision boundaries and each example as the importance, which is transformed into weights with a nonlinear increasing function. Then, it minimizes the weighted classification loss by assigning these weights to each example. Geometry-aware instancereweighted adversarial training (GAIRAT), proposed by (Zhang et al., 2021) , decides the importance for each example by the least PGD steps (LPS). GAIRAT represents the margin in an input space, because the LPS is the number of steps to cause an adversarial example to cross decision boundaries, starting from a benign example. Smaller-margin examples are closer to the decision boundaries and are assigned larger weights. Although GAIRAT achieves better robustness than standard AT, it is vulnerable against attacks other than PGD because it defines the importance in terms of the LPS. Meanwhile, margin-aware instance reweighting learning (MAIL) (Wang et al., 2021) successfully overcomes the weakness of GAIRAT by defining the importance with estimated probabilities. Specifically, it transforms the difference between two probabilities, i.e., between the true class and 1

