IMPROVING ADVERSARIAL ROBUSTNESS BY PUTTING MORE REGULARIZATIONS ON LESS ROBUST SAMPLES

Abstract

Adversarial training, which is to enhance robustness against adversarial attacks, has received much attention because it is easy to generate human-imperceptible perturbations of data to deceive a given deep neural network. In this paper, we propose a new adversarial training algorithm that is theoretically well motivated and empirically superior to other existing algorithms. A novel feature of the proposed algorithm is to apply more regularization to data vulnerable to adversarial attacks than other existing regularization algorithms do. Theoretically, we show that our algorithm can be understood as an algorithm of minimizing a newly derived upper bound of the robust risk. Numerical experiments illustrate that our proposed algorithm improves the generalization (accuracy on examples) and robustness (accuracy on adversarial attacks) simultaneously to achieve the state-of-the-art performance.

1. INTRODUCTION

It is easy to generate human-imperceptible perturbations that put prediction of a deep neural network (DNN) out. Such perturbed samples are called adversarial examples (Szegedy et al., 2014) and algorithms for generating adversarial examples are called adversarial attacks. It is well known that adversarial attacks can greatly reduce the accuracy of DNNs, for example from about 96% accuracy on clean data to almost zero accuracy on adversarial examples (Madry et al., 2018) . This vulnerability of DNNs can cause serious security problems when DNNs are applied to security critical applications (Kurakin et al., 2017; Jiang et al., 2019) such as medicine (Ma et al., 2020; Finlayson et al., 2019) and autonomous driving (Kurakin et al., 2017; Deng et al., 2020; Morgulis et al., 2019; Li et al., 2020) . Adversarial training, which is to enhance robustness against adversarial attacks, has received much attention. Various adversarial training algorithms can be categorized into two types. The first one is to learn prediction models by minimizing the robust risk -the risk for adversarial examples. PGD-AT (Madry et al., 2018) is the first of its kinds and various modifications including Zhang et al. (2020) ; Ding et al. (2020) ; Zhang et al. (2021) have been proposed since then. The second type of adversarial training algorithms is to minimize the regularized risk which is the sum of the empirical risk for clean examples and a regularized term related to adversarial robustness. TRADES (Zhang et al., 2019) decomposes the robust risk into the sum of the natural and boundary risks, where the first one is the risk for clean examples and the second one is the remaining part, and replaces them to their upper bounds to have the regularized risk. HAT (Rade & Moosavi-Dezfolli, 2022) modifies the regularization term of TRADES by adding an additional regularization term based on helper samples. The aim of this paper is to develop a new adversarial training algorithm for DNNs, which is theoretically well motivated and empirically superior to other existing competitors. Our algorithm modifies the regularization term of TRADES (Zhang et al., 2019) to put more regularization on less robust samples. This new regularization term is motivated by an upper bound of the boundary risk. Our proposed regularized term is similar to that used in MART (Wang et al., 2020) . The two key differences are that (1) the objective function of MART consists of the sum of the robust risk and regularization term while ours consists of the sum of the natural risk and regularization term and (2) our algorithm regularizes less robust samples more but MART regularizes less accurate samples more. Note that our algorithm is theoretically well motivated from an upper bound of the robust risk but no such theoretical explanation of MART is available. In numerical studies, we demonstrate that our algorithm outperforms MART as well as TRADES with large margins.

1.1. OUR CONTRIBUTIONS

We propose a new adversarial training algorithm. Novel features of our algorithm compared to other existing adversarial training algorithms are that it is theoretically well motivated and empirically superior. Our contributions can be summarized as follows: • We derive an upper bound of the robust risk for multi-classification problems. • As a surrogate version of this upper bound, we propose a new regularized risk. • We develop an adversarial training algorithm that learns a robust prediction model by minimizing the proposed regularized risk. • By analyzing benchmark data sets, we show that our proposed algorithm is superior to other competitors in view of the generalization (accuracy on clean examples) and robustness (accuracy on adversarial examples) simultaneously to achieve the state-of-the-art performance. • We illustrate that our algorithm is helpful to improve the fairness of the prediction model in the sense that the error rates of each class become more similar compared to TRADES.

2. PRELIMINARIES

2.1 ROBUST POPULATION RISK Let X ⊂ R d be the input space, Y = {1, • • • , C} be the set of output labels and f θ : X → R C be the score function parameterized by the neural network parameters θ (the vector of weights and biases) such that p θ ( •|x) = softmax(f θ (x)) is the vector of the conditional class probabilities. Let F θ (x) = arg max c [f θ (x)] c , B p (x, ε) = {x ′ ∈ X : ∥x -x ′ ∥ p ≤ ε} and 1(•) be the indicator function. Let capital letters X, Y denote random variables or vectors and small letters x, y denote their realizations. The robust population risk used in the adversarial training is defined as R rob (θ) := E (X,Y) max X ′ ∈Bp(X,ε) 1 {F θ (X ′ ) ̸ = Y} , where X and Y are a random vector in X and a random variable in Y, respectively. Most adversarial training algorithms learn θ by minimizing an empirical version of the above robust population risk. In turn, most empirical versions of (1) require to generate an adversarial example which is a surrogate version of x adv := arg max x ′ ∈Bp(x,ε) 1 {F θ (x ′ ) ̸ = y} . Any method of generating an adversarial example is called an adversarial attack.

2.2. ALGORITHMS FOR GENERATING ADVERSARIAL EXAMPLES

Existing adversarial attacks can be categorized into either the white-box attack (Goodfellow et al., 2015; Madry et al., 2018; Carlini & Wagner, 2017; Croce & Hein, 2020a) or the black-box attack (Papernot et al., 2016; 2017; Chen et al., 2017; Ilyas et al., 2018; Papernot et al., 2018) . For the whitebox attack, the model structure and parameters are known to adversaries who use this information for generating adversarial examples, while outputs for given inputs are only available to adversaries for the black-box attack. The most popular method for the white-box attack is PGD (Projected Gradient Descent) (Madry et al., 2018) . Let η(x ′ |θ, x, y) be a surrogate loss of 1 {F θ (x ′ ) ̸ = y} for given θ, x, y. PGD finds the adversarial example by applying the gradient ascent algorithm to η to update x ′ η and projecting it to B p (x, ε). That is, the update rule of PGD is x (t+1) = Π Bp(x,ε) x (t) + ν sgn ∇ x (t) η(x (t) |θ, x, y) , where ν > 0 is the step size, Π Bp(x,ε) (•) is the projection operator to B p (x, ε) and x (0) = x. We define x pgd as x pgd := lim t→∞ x (t) . For the surrogate loss η, the cross entropy (Madry et al., 2018) or the KL divergence (Zhang et al., 2019) is used. For the black-box attack, an adversary generates a dataset {x i , ỹi } n i=1 where ỹi is an output of a given input x i . Then, the adversary trains a substitute prediction model based on this data set, and generates adversarial examples from the substitute prediction model by PGD (Papernot et al., 2017) .

2.3. REVIEW OF ADVERSARIAL TRAINING ALGORITHMS

We review some of the adversarial training algorithms which, we think, are related to our proposed algorithm. Typically, adversarial training algorithms consist of the maximization and minimization steps. In the maximization step, we generate adversarial examples for given θ, and in the minimization step, we fix the adversarial examples and update θ. In the followings, we denote x pgd i as the adversarial example corresponding to (x i , y i ) generated by PGD.

2.3.1. ALGORITHMS MINIMIZING THE ROBUST RISK DIRECTLY

PGD- AT Madry et al. (2018) proposes PGD-AT which updates θ by minimizing n i=1 ℓ ce (f θ ( x pgd i ), y i ), where ℓ ce is the cross-entropy loss. GAIR-AT Geometry Aware Instance Reweighted Adversarial Training (GAIR-AT) (Zhang et al., 2021) is a modification of PGD-AT, where the weighted robust risk is minimized and more weights are given to samples closer to the decision boundary. To be more specific, the weighted empirical risk of GAIR-AT is given as n i=1 w θ (x i , y i )ℓ ce (f θ ( x pgd i ), y i ), where κ θ (x i , y i ) = min min({t : F θ (x (t) i ) ̸ = y i }), T for a prespecified maximum iteration T and w θ (x i , y i ) = (1 + tanh(5(1 -2κ θ (x i , y i )/T )))/2. There are other similar modifications of PGA-AT including Max-Margin Adversarial (MMA) Training (Ding et al., 2020) and Friendly Adversarial Training (FAT) (Zhang et al., 2020) .

2.3.2. ALGORITHMS MINIMIZING A REGULARIZED EMPIRICAL RISK

Robust risk, natural risk and boundary risk are defined by R rob (θ) = E (X,Y ) 1 {∃X ′ ∈ B p (X, ε) : F θ (X ′ ) ̸ = Y } , R nat (θ) = E (X,Y ) 1 {F θ (X) ̸ = Y } , R bdy (θ) = E (X,Y ) 1 {∃X ′ ∈ B p (X, ε) : F θ (X) ̸ = F θ (X ′ ), F θ (X) = Y } . Zhang et al. (2019) shows R rob (θ) = R nat (θ) + R bdy (θ). By treating R bdy (θ) as the regularization term, various regularized risks for adversarial training have been proposed. TRADES Zhang et al. (2019) proposes the following regularized empirical risk which is a surrogate version of the upper bound of the robust risk: n i=1 ℓ ce (f θ (x i ), y i ) + λ • KL(p θ (•|x i )∥p θ (•| x pgd i )) , HAT Helper based training (Rade & Moosavi-Dezfolli, 2022 ) is a variation of TRADES where an additional regularization term based on helper examples is added to the regularized risk. The role of helper examples is to restrain the decision boundary from having excessive margins. HAT minimizes the following regularized empirical risk: n i=1 ℓ ce (f θ (x i ) , y i ) + λ • KL p θ (•|x i ) ∥p θ (•| x pgd i ) + γℓ ce f θ (x helper i ), F θpre ( x pgd i ) , where θ pre is the parameter of a pre-trained model only with clean examples, x helper i = x i + 2( x pgd i - x i ). MART Misclassification Aware adveRsarial Training (MART) (Wang et al., 2020) minimizes n i=1 ℓ margin (f θ ( x pgd i ), y i ) + λ • KL(p θ (•|x i )∥p θ (•| x pgd i ))(1 -p θ (y i |x i )) , where ℓ margin (f θ ( x pgd i ), y i ) = -log p θ (y i | x pgd i ) -log(1 -max k̸ =yi p θ (k| x pgd i )) . This objective function can be regarded as the regularized robust risk and thus MART can be considered as a hybrid algorithm of PGD-AT and TRADES.

3. ANTI-ROBUST WEIGHTED REGULARIZATION (AROW)

In this section, we develop a new adversarial training algorithm called Anti-Robust Weighted Regularization (ARoW), which is an algorithm minimizing a regularized risk. We propose a new regularized term which applies more regularization to data vulnerable to adversarial attacks than other existing algorithms such as TRADES and HAT do. Our new regularized term is motivated by the upper bound of the robust risk derived in the following section.

3.1. UPPER BOUND OF THE ROBUST RISK

In this subsection, we provide an upper bound of the robust risk for multi-classification problem which is stated in the following theorem. The proof is deferred to Appendix A. Theorem 1. For a given score function f θ , let z(•) be an any measurable mapping from X to X satisfying z(x) ∈ arg max x ′ ∈Bp(x,ε) 1 (F θ (x) ̸ = F θ (x ′ )) . for every x ∈ X . Then, we have R rob (θ) ≤ E (X,Y ) 1(Y ̸ = F θ (X)) + E (X,Y ) 1(F θ (X) ̸ = F θ (z(X)))1 {p θ (Y |z(X)) < 1/2} (4) The upper bound (4) consists of the two terms : the first term is the natural risk itself and the second term is an upper bound of the boundary risk. This upper bound is motivated by the upper bound derived in TRADES (Zhang et al., 2019) . For binary classification problems, Zhang et al. (2019) shows that R rob (θ) ≤ E (X,Y ) ϕ(Y f θ (X)) + E X ϕ(f θ (X)f θ (z(X))), (5) where z(x) ∈ arg max x ′ ∈Bp(x,ε) ϕ (f θ (x)f θ (x ′ )) and ϕ(•) is an upper bound of 1(• < 0). Our upper bound (4) is a modification of the upper bound (5) for multiclass problems where ϕ(•) and f θ in (5) are replaced by 1(• < 0) and F θ , respectively. A key difference, however, between (4) and ( 5) is the term 1 {p θ (Y |z(X)) < 1/2} at the last part of (4) that is not in (5). It is interesting to see that the upper bound in Theorem 1 becomes equal to the robust risk for binary classification problems. That is, the upper bound ( 4) is an another formulation of the robust risk. However, this rephrased formula of the robust risk is useful since it provides a new learning algorithm when the indicator functions are replaced by their surrogates as we do. Algorithm 1 ARoW Algorithm 6), number of epochs T , number of batch B, batch size K Output : adversarially robust network f θ 1: Input : network f θ , training dataset D = (x i , y i ) ∈ R d+1 : i = 1, • • • , n , learning rate η, hyperparameters (λ, α) of ( for t = 1, • • • , T do 2: for b = 1, • • • , B do 3: x pgd t,b,k ← arg max x ′ ∈Bp(x t,b,k ,ε) KL(p θ (•|x t,b,k )∥p θ (•|x ′ )) ; x t,b,k ∈ R d , k = 1, . . . , K 4: θ ← θ -η 1 K ∇ θ R ARoW (θ; {(x t,b,k , y t,b,k )} K k=1 , λ, α) , where R ARoW is (6).

5:

end for 6: end for 7: Return f θ

3.2. ALGORITHM

By replacing the indicator functions in Theorem 1 by their smooth proxies, we propose a new regularized risk and develop the corresponding adversarial learning algorithm called the Anti-Robust Weighted Regularization (ARoW) algorithm. The four indicator functions in (4) are replaced by • the adversarial example z(x) is replaced by x pgd obtained by the PGD algorithm with the KL divergence; • the term 1(Y ̸ = F θ (X)) is replaced by the label smooth cross-entropy (Müller et al., 2019 ) ℓ LS (f θ (x), y) = -y LS α ⊤ log p θ (•|x) for a given α > 0, where y LS α = (1 -α)u y + α C 1 C , u y ∈ R C is the one-hot vector whose the y-th entry is 1 and 1 C ∈ R C is the vector whose entries are all 1; • the term 1(F θ (X) ̸ = F θ (z(X))) is replaced by λ • KL(p θ (•|X)||p θ (•| X pgd )) for λ > 0; • the term 1 {p θ (Y |z(X)) < 1/2} is replaced by its convex upper bound 2(1 -p θ (Y | X pgd )); to have the following regularized risk for ARoW, which is a smooth surrogate of the upper bound (4), R ARoW (θ; {(x i , y i )} n i=1 , λ) := n i=1 ℓ LS (f θ (x i ), y i ) + 2λ • KL(p θ (•|x i )||p θ (•| x pgd i )) • (1 -p θ (y i | x pgd i )) . Here, we introduce the regularization parameter λ > 0 to control the robustness of a trained prediction model to adversarial attacks. That is, the regularized risk (6) can be considered as a smooth surrogate of the regularized robust risk of R nat (θ) + λR bdy (θ). We use the label smoothing cross-entropy as a surrogate for 1(Y ̸ = F θ (X)) instead of the standard cross-entropy to estimate the conditional class probabilities p θ (•|x) more accurately (Müller et al., 2019) . The accurate estimation of p θ (•|x) is important since it is used in the regularization term of ARoW. It is well known that DNNs trained by minimizing the cross-entropy are poorly calibrated (Guo et al., 2017) , and so we use the label smoothing cross-entropy technique. We set α = 0.2 in our numerical studies for simplicity even if it can be tuned optimally. The ARoW algorithm, which learns θ by minimizing R ARoW (θ; {(x i , y i )} n i=1 , λ), is summarized in Algorithm 1. Comparison to TRADES A key difference of the regularized risks of ARoW and TRADES is that TRADES does not have the term (1 -p θ (y i | x pgd i )) at the last part of (6). That is, ARoW puts more regularization to samples which are vulnerable to adversarial attacks (i.e. p θ (y i | x pgd i ) is small). Note that this term is motivated by the tighter upper bound of the robust risk (4) and thus is expected to lead better results. Numerical studies confirm that it really works. Comparison to MART Although the objective function in MART (3) has no theoretical basis, it is similar with the objective function of ARoW. But, there are two main differences. First, the supervised loss term of ARoW is the label smoothing loss with clean examples, whereas MART uses the margin cross entropy loss with adversarial examples. Second, regularization term in MART is proportional to (1 -p θ (y|x)) while that in ARoW is proportional to (1 -p θ (y| x pgd )). In the numerical studies, we observe that ARoW outperforms MART with large margins. These would be partly because ARoW is theoretically well motivated.

4. EXPERIMENTS

In this section, we investigate the ARoW algorithm in view of robustness and generalization by analyzing the three benchmark data sets -CIFAR10 (Krizhevsky, 2009) , F-MINST (Xiao et al., 2017) and SVHN dataset (Netzer et al., 2011) . In particular, we show that ARoW is superior to existing algorithms including TRADES (Zhang et al., 2019) , HAT (Rade & Moosavi-Dezfolli, 2022) and MART (Wang et al., 2020) as well as PGD-AT (Madry et al., 2018) and GAIR-AT (Zhang et al., 2021) to achieve state-of-art performances. WideResNet-34-10 (WRN-34-10) (Zagoruyko & Komodakis, 2016) and ResNet-18 (He et al., 2016) are used for CIFAR10 while ResNet-18 (He et al., 2016) is used for F-MNIST and SVHN. Experimental details are presented in Appendix B.

4.1. COMPARISON OF AROW TO TRADES, HAT AND MART

We compare ARoW to the the regularization algorithms TRADES (Zhang et al., 2019) , HAT (Rade & Moosavi-Dezfolli, 2022) and MART (Wang et al., 2020) explained in Section 2.3.2. Table 1 shows that ARoW outperforms the other regularization algorithms for various data sets and architectures in terms of both the standard and robust accuracies. The selected values of the hyper-parameters for the other algorithms are listed in Appendix B.2. To investigate whether ARoW dominates its competitors uniformly with respect to the regularization parameter λ, we compare the trade-off between the standard and robust accuracies of ARoW and other regularization algorithms when λ varies. Figure 1 draws the plots of the standard accuracies in the x-axis and the robust accuracies in the y-axis obtained by the corresponding algorithms with various values of λ. For this experiment, we use CIFAR10 and WideResNet-34-10 (WRN-34-10) architecture. The trade-off between the standard and robust accuracies is well observed (i.e. a larger regularization parameter λ yields lower standard accuracy but higher robust accuracy). Moreover, we can clearly see that ARoW uniformly dominates TRADES and HAT (and MART) regardless of the choice of the regularization parameter and the methods for adversarial attack. Additional results for the trade-off are provided in Appendix F.1.

4.2. COMPARISON OF AROW TO PGD-AT AND GAIR-AT

We compare ARoW with PGD-AT (Madry et al., 2018) and GAIR-AT (Zhang et al., 2021) which are the algorithms minimizing the robust risk directly. Table 2 shows that ARoW outperforms PGD-AT and GAIR-AT in terms of the standard accuracy and the robust accuracy against to AutoAttack (Croce & Hein, 2020b) . GAIR-AT is, however, better for PGD 20 attack than ARoW. This would be mainly Table 3 compares ARoW with the exiting algorithms for extra data, which shows that ARoW achieves the state-of-the-art performance when extra data are available even though the margins compared to HAT are not significant. Note that ARoW has advantages other than the high robust accuracies. For example, ARoW is easy to implement compared to HAT since HAT requires a pre-trained model and it needs additional memory. Moreover, as we will see in Section 4.5, ARoW improves the fairness compared to TRADE while HAT does not.

4.4. ABLATION STUDIES

We study the following three issues -(i) the effect of label smoothing to ARoW, (ii) the role of the new regularization term in ARoW to improve robustness and (iii) modifications of ARoW by applying tools which improve existing adversarial training algorithms.

4.4.1. EFFECT OF LABEL SMOOTHING

Table 4 indicates that label smoothing is helpful not only for ARoW but also for TRADES. This would be partly because the regularization terms in ARoW and TRADES depend on the conditional class probabilities and it is well known that label smoothing is helpful for the calibration of the conditional class probabilities (Pereyra et al., 2017) . Moreover, the results in Table 4 imply that label smoothing is not a main reason for ARoW to outperform TRADES. Even without label smoothing, ARoW is still superior to TRADES (even with the label smoothing). Appendix F.2 presents the results of an additional experiment to assess the effect of label smoothing to the performance. The regularization term of ARoW puts more regularization to less robust samples, and thus we expect that ARoW improves the robustness of less robust samples much. To confirm this conjecture, we do a small experiment. First, we divide the test data into four groups -least robust, less robust, robust and highly robust according to the values of p θPGD (y i | x pgd i ) (< 0.3, 0.3 ∼ 0.5, 0.5 ∼ 0.7 and > 0.7), where θ PGD is the parameter learned by PGD-AT (Madry et al., 2018) foot_0 . Then, for each group, we check how many samples become robust for ARoW and TRADES, respectively, whose results are presented in Table 5 . Note that ARoW improves the robustness of least robust samples most compared with TRADES. We believe that this improvement is due to the regularization term in ARoW that enforces more regularization on less robust samples. (Wu et al., 2020) and Friendly Adversarial Training (FAT) (Zhang et al., 2020) . AWP is a tool to find a flat minimum of the objective function and FAT uses early-stopped PGD when generating adversarial examples in the training phase. Details about AWP and FAT are given in Appendix F.4. We investigate how ARoW performs when it is modified by such a tool. We consider the two modifications of ARoW -ARoW-AWP and ARoW-FAT, where ARoW-AWP searches a flat minimum of the ARoW objective function and ARoW-FAT uses early-stopped PGD in the training phase of ARoW. Table 6 compares ARoW-AWP and ARoW-FAT to TRDAES-AWP and TRADES-FAT. Both of AWP and FAT are helpful for ARoW and TRADES but ARoW still outperforms TRADES with large margins even after modified by AWP or FAT. 2021) reports that TRADES (Zhang et al., 2019) increases the variation of the per-class accuracies (accuracy in each class) which is not desirable in view of fairness. In turn, Xu et al. (2021) proposes the Fair-Robust-Learning (FRL) algorithm to alleviate this problem. Even if fairness becomes improved, the standard and robust accuracies of FRL are worse than TRADES. In contrast, Table 7 shows that ARoW improves the fairness as well as the standard and robust accuracies compared to TRADES. This desirable property of ARoW can be partly understood as follows. The main idea of ARoW is to impose more robust regularization to less robust samples. In turn, samples in less accurate classes tend to be more vulnerable to adversarial attacks. Thus, ARoW improves the robustness of samples in less accurate classes which results in improved robustness as well as improved generalization for such less accurate classes. The class-wise accuracies are presented in Appendix G. Table 7 : Class-wise accuracy disparity for CIFAR10. We report the accuracy (ACC), the worst-class accuracy (WC-Acc) and the standard deviation of class-wise accuracies (SD) for each method. A novel feature of ARoW is to impose more regularization on less robust samples than TRADES. The results of numerical experiments shows that ARoW improves the standard and robust accuracies simultaneously to achieve state-of-the-art performances. In addition, ARoW enhances the fairness of the prediction model without hampering the accuracies.

Method

When we developed a computable surrogate of the upper bound of the robust risk in Theorem 1, we replaced 1(F θ (X) ̸ = F θ (z(X)))) by KL(p θ (•|X)||p θ (•| X pgd )). The KL divergence, however, is not an upper bound of the 0-1 loss and thus our surrogate is not an upper bound of the robust risk. We employed the KL divergence surrogate to make the objective function of ARoW be similar to that of TRADES. It would be worth pursuing to devise an alternative surrogate for the 0-1 loss to reduce the gap between the theory and algorithm.

B DETAILED SETTINGS FOR THE EXPERIMENTS WITH BENCHMARK DATASETS B.1 EXPERIMENTAL SETUP

For CIFAR10, SVHN and FMNIST datasets, input images are normalized into [0, 1]. Random crop and random horizontal flip with probability 0.5 are used for CIFAR10 while only random horizontal flip with probability 0.5 is applied for SVHN. For FMNIST, augmentation is not used. For generating adversarial examples in the training phase, PGD 10 with random initial, p = ∞, ε = 8/255 and ν = 2/255 is used, where PGD T is the output of the PGD algorithm (2) with T iterations. For training prediction models, the SGD with momentum 0.9, weight decay 5 × 10 -4 , the initial learning rate of 0.1 and batch size of 128 is used and the learning rate is reduced by a factor of 10 at 60 and 90 epochs. Stochastic weighting average (SWA) (Izmailov et al., 2018) is employed after 50-epochs for preventing from robust overfitting (Rice et al., 2020) as Chen et al. ( 2021) does. For evaluating the robustness in the test phase, PGD 20 and AutoAttack are used for adversarial attacks, where AutoAttack consists of three white box attacks -APGD and APGD-DLR in Croce & Hein (2020b) and FAB in Croce & Hein (2020a) and one black box attack -Square Attack (Andriushchenko et al., 2020) . To the best of our knowledge, AutoAttack is the strongest attack. The final model is set to be the best model against PGD 10 on the test data among those obtained until 120 epochs. 

B.2 HYPERPARAMETER SETTING

- 5e -4 - o HAT 4 0.25 5e -4 - o MART 6 - 2e -4 - x PGD-AT - - 5e -4 - o GAIR-AT - - 5e -4 - o ARoW 3 - 5e -4 0.2 o ResNet-18 TRADES 6 - 5e -4 - o HAT 4 0.5 5e -4 - o MART 6 - 5e -4 - x PGD-AT - - 5e -4 - o GAIR-AT - - 5e -4 - o ARoW 5 - 5e -4 0.2 o SVHN ResNet-18 TRADES 6 - 5e -4 - x HAT 4 0.5 5e -4 - x MART 6 - 5e -4 - x PGD-AT - - 5e -4 - x GAIR-AT - - 5e -4 - x ARoW 3 - 5e -4 0.2 x FMNIST ResNet-18 TRADES 6 - 5e -4 - x HAT 5 0.15 5e -4 - x MART 6 - 5e -4 - x PGD-AT - - 5e -4 - x GAIR-AT - - 5e -4 - x ARoW 6 - 5e -4 0.2 x Table 8 presents the hyperparameters used on our experiments. Most of the hyperparameters are set to be the ones used in the previous studies. The weight decay parameter is set to be 5e -4 in most experiments, which is the well-known optimal value. Only for MART (Wang et al., 2020) with WRN34-10, we use weight decay 2e -4 as Wang et al. (2020) did since MART works poorly with 5e -4 . We use stochastic weight averaging (SWA) for CIFAR10 except MART. Note that SWA is not used in the experiments of Wang et al. (2020) , and we confirm that SWA is not helpful for MART. Effects of SWA for all methods are provided in Appendix F.3.

C CHECKING THE GRADIENT MASKING

Table 9 : Comparison of GAIR-AT and ARoW. We compare the robustness of GAIR-AT (Zhang et al., 2021) and ARoW against the four attacks used in AutoAttack on CIFAR10. The results are based on WRN-34-10. We set λ = 3 for ARoW. Gradient masking (Papernot et al., 2018; 2017) is the case that the gradient of the loss for a given non-robust datum is almost zero (i.e. ∇ x ℓ ce (f θ (x), y) ≈ 0). In this case, PGD cannot generate an adversarial example. We can check the ocuurence of gradient masking when a prediction model is robust to the PGD attack but not robust to attacks such as FAB (Croce & Hein, 2020a) , APGD-DLR (Croce & Hein, 2020b) and SQUARE (Andriushchenko et al., 2020) .

Method

In Table 9 , the robustness of GAIR-AT becomes worse much for the three attacks in AutoAttack except APGD (Croce & Hein, 2020b) while the robustness of ARoW remains stable regardless of the adversarial attacks. Since APGD uses the gradient of the loss, this observation implies that the gradient masking occurs in GAIR-AT while it does not in ARoW. Better performance of GAIR-AT for PGD 20 attack in We compare the standard and robust accuracies of the adversarial training algorithms with and without SWA whose results are summarized in Table 15 . SWA improves the accuracies for all the algorithms except MART. Without SWA, ARoW is competitive to HAT, which is known to be the SOTA method. However, ARoW dominates HAT when SWA is applied. In our experiment, we set γ to be 0.005 which is the value used in Wu et al. (2020) and do not use SWA as did in original paper. where t i = min min{t : F θ ( x (t) i ) ̸ = y i } + K, T . Here, T is the maximum iterations of PGD. We propose an adversarial training algorithm ARoW-FAT by combining ARoW and early-stopped PGD. ARoW-FAT minimizes the following regularized empirical risk: n i=1 ℓ LS α (f θ (x i ), y i ) + 2λ • KL(p θ (•|x i )∥p θ (•| x (ti) i )) • (1 -p θ (y i | x (ti) i )) . In the experiments, we set K to be 2, which is the value used in Zhang et al. (2020) . In Table 16 , we present the per-class robust and standard accuracies of the prediction models trained by TRADES and ARoW. We can see that ARoW is highly effective for classes difficult to be classified such as Bird, Cat, Deer and Dog. For such classes, ARoW improves much not only the standard accuracies but also the robust accuracies. For example, in the class 'Cat', which is the most difficult class (the lowest standard accuarcy for TRADES and ARoW), the robustness and generalization are improved by 4.1 percentage point (26.1% → 30.2%) and 9.2 percentage point (65.9% → 75.1%) by ARoW compared with TRADES, respectively. This desirable results would be mainly due to the new regularization term in ARoW. Usually, difficult classes are less robust to adversarial attacks. By putting more regularization on less robust classes, ARoW improves the accuracies of less robust classes more.



We use PGD-AT instead of a standard non-robust training algorithm since all samples become least robust for a non-robust prediction model.



Figure 1: Comparison of ARoW, TRADES and HAT with varying λ. The x-axis and y-axis are the standard and robust accuracies, respectively. The robust accuracies in the left panel are against PGD 20 while the robust accuracies in the right panel are against AutoAttack. We exclude the results of MART from the figures because its rboust and standard accuracies are too low.

improving performance on CIFAR10,Carmon et al. (2019) andRebuffi et al. (2021) use extra unlabeled data sets with TRADES.Carmon et al. (2019) uses an additional subset of 500K extracted from 80 Million Tiny Images (80M-TI) andRebuffi et al. (2021) uses a data set of 1M synthetic samples generated by a denoising diffusion probabilistic model (DDPM)(Ho et al., 2020) along with the SiLU activation function and Exponential Moving Average (EMA). Further,Rade & Moosavi- Dezfolli (2022) shows that HAT achieves the SOTA performance for these extra data.

function of the adversarial training, AWP(Wu et al., 2020) tries to find a flat minimum in the parameter space.Wu et al. (2020) proposes TRADES-AWP, which minimizesmin ce (f θ+δ (x i ), y i ) + λ • KL(p θ+δ (•|x i )∥p θ+δ (•| x pgd i )) ,where θ l is the weight vector of l-th layer and γ is the weight perturbation size. Inspired by TRADES-AWP, we propose ARoW-AWP which minimizesmin θ max ∥δ l ∥≤γ∥θ l ∥ 1 n n i=1 ℓ ce (f θ+δ (x i ), y i ) + 2λ • KL(p θ+δ (•|x i )∥p θ+δ (•| x pgd i )) • (1 -p θ (y i | x pgd i )) .

.2 FRIENDLY ADVERSARIAL TRAINING (FAT)Zhang et al. (2020) suggests early-stopped PGD which uses a data-adaptive iterations of PGD when an adversarial example is generated. TRADES-FAT, which uses the early-stopped PGD in TRADES, minimizesn i=1 ℓ ce (f θ (x i ), y i ) + λ • KL(p θ (•|x i )∥p θ (•| x

Comparison of ARoW, TRADES, HAT and MART. We conduct the experiment three times with different seeds and present the averages of the accuracies with the standard errors in the brackets.because of the gradient masking(Papernot et al., 2018; 2017)  -PGD does not find an adversarial example well. See Appendix C for details about gradient masking

Comparison of ARoW to PGD-AT and GAIR-AT. We conduct the experiment three times with different seeds and present the averages of the accuracies with the standard errors in the brackets.

Comparison of ARoW to other adversarial algorithms with extra data on CIFAR10.

Comparison of TRADES and ARoW with/without label smoothing. With WRN-28-10 architecture and CIFAR10 dataset, we use λ = 6 for TRADES while use λ = 3 for ARoW.

Role of the new regularization term in ARoW. # Rob TRADES and # Rob ARoW represent the number of samples which are only robust to TRADES but not to ARoW, or vice versa. Diff. and Rate of Impro. denote (# Rob ARoW -# Rob TRADES ) and Diff. / # Rob TRADES ) Sample's Robustness # Rob TRADES # Rob ARoW Diff. Rate of Impro. (%)

Modifications of TRADES and ARoW. We use CIFAR10 dataset and ResNet-18 architecture. More details of hyerparameters are provided in Appendix F.4.

this paper, we derived an upper bound of the robust risk and developed a new algorithm for adversarial training called ARoW which minimizes a surrogate version of the derived upper bound.

Selected hyperparameters. Hyperparameters used in the numerical studies in Section 4.1 and Section 4.2.

Table2is not because GAIR-AT is robust to adversarial attacks but because adversarial examples obtained by PGD are close to clean samples. This claim is supported by the fact that GAIR-AT performs poorly for AutoAttack while it is still robust to other PGD-based adversarial attacks. Moreover, gradient masking for GAIR-AT is already reported byHitaj et al. (2021). Selected hyperparameters. Hyperparameters used in the numerical studies in Section 4.3. We do not employ cutmix augmentation(Yun et al., 2019) as does inRade & Moosavi-Dezfolli (2022).Rebuffi et al. (2021) use the SiLU activation function and exponential model averaging (EMA) based on TRADES. For HAT(Rade & Moosavi-Dezfolli, 2022) and ARoW, we use the SiLU activation function and exponential model averaging (EMA) with weight decay factor 0.995 as is done inRebuffi et al. (2021). The cosine annealing learning rate scheduler(Loshchilov & Hutter, 2017) is used with the batch size 512. The final model is set to be the best model against PGD 10 on the test data among those obtained until 500 epochs.

Comparison of ARoW to competitors on CIFAR100. We compare ARoW to PGD-AT, TRADES, HAT and MART on CIFAR100. We used WRN-28-10 architecture. For better comparison, we conduct an additional experiment with extra data where the same architecture -PreaAct-ResNet18 is used. In addition, we set batch size to 1024 which is used inRade & Moosavi-Dezfolli (2022). Table12shows that ARoW outperforms HAT both on standard accuracy(+0.29%) and robust accuracy(+0.11%) against autoattack.

Performance with extra data(Carmon et al.)  on CIFAR10. We brought the values in the paper as reported inRade & Moosavi-Dezfolli (2022).Table13presents the trade-off between the generalization and robustness accuracies of ARoW on CIFAR10 due to the choice of λ, where ResNet18 is used. The trade-off is obviously observed.

Standard and robust accuracies of ARoW on CIFAR10 for varying λ.



Standard and robust accuracies of ARoW on CIFAR10 for varying α.

Effects of SWA on CIFAR10 with WideResNet 34-10. We conduct the experiment three times with different seeds and present the averages of the accuracies with the standard errors in the brackets. 'w/o' stands for 'without'.

Comparison of per-class robustness and generalization of TRADES and ARoW. Rob TRADES and Rob ARoW are the robust accuracies against PGD 20 of TRADES and ARoW, respectively. Stand TRADES and Stand ARoW are the standard accuracies.

acknowledgement

We have seen in Section 4.5 that ARoW improves fairness as well as accuracies. The advantage of ARoW in view of fairness is an unexpected by-product, and it would be interesting to develop a more principled way of enhancing the fairness further without hampering the accuracy.

Appendices

A PROOF OF THEOREM 1In this section, we prove Theorem 1. The following lemma provides the key inequality for the proof. Lemma 2. For a given score function f θ , let z(•) be an any measurable mapping from X to X satisfying z(x) ∈ arg maxx ′ ∈Bp(x,ε)for every x ∈ X . Then, we haveProof. The inequality holds obviously if)) = 0 and the equality holds if and only. By the definition of z(x), the left side of (A.1) is 0 since)} = 0, and hence the inequality holds., which is a contradiction to the definition of z(x). Hence, the left side of (A.1) should be 0, and we complete the proof of the inequality.Theorem 1. For a given score function f θ , let z(•) be an any measurable mapping from X to X satisfying z(x) ∈ arg maxx ′ ∈Bp(x,ε)for every x ∈ X . Then, we havethe inequality (4) holds.

