REVISITING INSTANCE-REWEIGHTED ADVERSARIAL TRAINING

Abstract

Instance-reweighted adversarial training (IRAT) is a type of adversarial training that assigns large weights to high-importance examples and then minimizes the weighted loss. The importance often uses the margins between decision boundaries and each example. In particular, IRAT can alleviate robust overfitting and obtain excellent robustness by computing margins with an estimated probability. However, previous works implicitly dealt with binary classification even in the multi-class cases, because they computed margins with only the true class and the most confusing class. The computed margins can become equal even with different true probability examples, because of the complex decision boundaries in multi-class classification. In this paper, first, we clarify the above problem with a specific example. Then, we propose margin reweighting, which can transform the previous margins into appropriate representations for multi-class classification by leveraging the relations between the most confusing class and other classes. Experimental results on the CIFAR-10/100 datasets demonstrate that the proposed method is effective in boosting the robustness against several attacks as compared to the previous methods.

1. INTRODUCTION

While convolutional neural networks (CNNs) achieve excellent performance on various tasks, they are vulnerable to adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2015) with malicious, perturbed samples. Such perturbation is a threat against CNN-based AI systems (e.g., those for autonomous driving or medical diagnosis) because it is imperceptible to humans. Thus, there are various effective approaches to mitigate the negative impact of perturbation (Papernot et al., 2016; Samangouei et al., 2018; Xu et al., 2018; Madry et al., 2018) . Among them, adversarial training (AT) (Madry et al., 2018) is well known as an attractive defense strategy because of its clarity and efficacy. Instead of using benign examples, AT trains adversarial examples generated by projected gradient descent (PGD) (Madry et al., 2018) . Although AT can achieve excellent robustness, it can also exhibit performance degradation for benign examples or robust overfitting (Zhang et al., 2019; Rice et al., 2020) . Instance-reweighted adversarial training (IRAT) (Zeng et al., 2021; Kim et al., 2021; Zhang et al., 2021; Wang et al., 2021; Gao et al., 2021) is an effective method among the many approaches developed to overcome these issues. IRAT computes the margins between the decision boundaries and each example as the importance, which is transformed into weights with a nonlinear increasing function. Then, it minimizes the weighted classification loss by assigning these weights to each example. Geometry-aware instancereweighted adversarial training (GAIRAT), proposed by (Zhang et al., 2021) , decides the importance for each example by the least PGD steps (LPS). GAIRAT represents the margin in an input space, because the LPS is the number of steps to cause an adversarial example to cross decision boundaries, starting from a benign example. Smaller-margin examples are closer to the decision boundaries and are assigned larger weights. Although GAIRAT achieves better robustness than standard AT, it is vulnerable against attacks other than PGD because it defines the importance in terms of the LPS. Meanwhile, margin-aware instance reweighting learning (MAIL) (Wang et al., 2021) successfully overcomes the weakness of GAIRAT by defining the importance with estimated probabilities. Specifically, it transforms the difference between two probabilities, i.e., between the true class and the most confusing class, to a weight with a nonlinear increasing function. Weighted minimax risk (WMMR) (Zeng et al., 2021) uses the same approach. The difference between these methods is the use of different weighting functions, and MAIL achieves better performance than WMMR. Unlike GAIRAT with its discrete representation for weights, MAIL alleviates robust overfitting by using continuous weights. However, MAIL and WMMR have a problem in that the importance cannot be adequately represented in multi-class classification, because only the most confusing class and the true class are considered. As shown in Fig. 1 (b), we assume that this issue occurs in previous approaches for instances that have the same margin. Intuitively, among examples that have the same margin, such as x 1 and x 2 , the importance should be large closer to the intersection of the decision boundaries of multiple classes, but previous methods neglect this representation. In this paper, to resolve this issue, we reveal the problem with the previous margin computation through a specific example. Then, as illustrated in Fig. 1(c ), we propose margin reweighting, which enables us to transform the previous margins to an appropriate representation by considering classes other than the most confusing class and the true class. Although there is a straightforward approach for this, which arises from computing the margins between the true class and other classes (i.e., by using a multi-class margin), it is hard to design a weighting function to aggregate the multi-class margin to a single weight. Thus, we propose a novel metric: the ratio of the top2 in the incorrect rate (i.e., the sum of all probabilities except the true class). Assuming that each class probability well represents relationship to the center of a class, this metric can identify whether examples are close to the intersections of multi-class boundaries. Therefore, we do not have to design a special weighting function, and we can get appropriate representations just by multiplying this measure by the previous margins. We performed experiments and demonstrated that the proposed method can boost the performance of the previous methods for certain attacks. In summary, our work makes the following contributions: • We clarify that the previous approach of computing the margin by using predicted probabilities is insufficient. Specifically, we show a case in which the same margins are computed for certain examples even though both the true and most confusing class probabilities are different. • We propose margin reweighting, which enables transformation of the previous margins into appropriate ones by leveraging a relation between the most confusing class probability and the incorrect rate. • We experimentally show that our approach is effective for boosting the robustness against adversarial attacks.

2. PRELIMINARIES AND RELATED WORKS

In this section, we give an overview of standard training and adversarial training (AT), and then we describe related works on instance-reweighted adversarial training (IRAT).

2.1. STANDARD TRAINING VS. ADVERSARIAL TRAINING

Standard training: Let D = {x i , y i } n i=1 be a training dataset where x i ∈ R c×h×w is an input example and y i is a ground truth label. In standard training, a deep neural net f : R c×h×w → R K parameterized by θ minimizes the loss ℓ(f θ (x i ), y i ): min θ E (xi,yi)∼D [ℓ(f θ (x i ), y i )], where K is the number of classes, and the loss function ℓ(•) uses the cross-entropy loss. Adversarial training: AT (Madry et al., 2018) aims to obtain a model with robustness against adversarial attacks by training on computed adversarial examples. Actually, in the field except for image classification, this framework uses before deep learning became popular (Dalvi et al., 2004; Lowd & Meek, 2005) . Unlike standard training, AT consists of two processes, inner maximization and outer minimization. First, the inner maximization computes a perturbation δ i so as to maximize the loss within a radius ϵ centered on x i . Then, the outer minimization updates the weight parameters θ to minimize the loss with adversarial examples xi = x i + δ i obtained by the inner maximization.

